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Abstract 

It  is  well-known  that  Consensus,  a  fundamental  problem  of  fault-tolerant  dis¬ 
tributed  computing,  cannot  be  solved  in  asynchronous  systems  with  crash  failures. 

This  impossibility  result  stems  from  the  lack  of  reliable  failure  detection  in  such 
systems.  To  circumvent  such  impossibility  results,  we  introduce  the  concept  of  un- 
reliable  failure  detectors  that  can  make  mistakes,  and  study  the  problem  of  using 
them  to  solve  Consensus. 

We  characterize  unreliable  failure  detectors  by  two  types  of  properties:  com¬ 
pleteness  and  accuracy.  Informally,  completeness  requires  that  the  failure  detector 
eventually  suspects  every  process  that  actually  crashes,  while  accuracy  restricts  the 
mistakes  that  it  can  make.  We  define  a  hierarchy  of  failure  detectors  based  on  the 
strength  of  their  accuracy.  We  determine  which  failure  detectors  in  this  hierarchy 
can  be  used  to  solve  Consensus  despite  any  number  of  crashes,  and  which  ones 
require  a  majority  of  correct  processes. 

We  show  that  Consensus  can  be  solved  with  “weak”  failure  detectors,  i.e.,  failure 
detectors  that  make  an  infinite  number  of  mistakes.  This  leads  to  the  following 
question:  What  is  the  “weakest”  failure  detector  for  solving  Consensus?  In  a 
companion  paper,  we  show  that  OW,  one  of  the  failure  detector  that  we  consider 
here,  is  the  weakest  failure  detector  for  solving  Consensus  in  asynchronous  systems. 

In  this  paper,  we  show  that  Consensus  and  Atomic  Broadcast  are  reducible 
to  each  other  in  asynchronous  systems.  Thus,  all  our  results  apply  to  Atomic 
Broadcast  as  well. 

’Research  supported  by  NSF  grants  CCR-8901780  and  CCR-9102231,  DARPA/NASA  Ames  Grant 
NAG-2-593,  and  in  part  by  Grants  from  IBM  and  Siemens  Corp.  A  preliminary  version  of  this  paper 
appeared  in  Proceedings  of  the  Tenth  ACM  Symposium  on  Principles  of  Distributed  Computing,  pages 
325-340.  ACM  press,  August  1991. 

*Also  supported  by  an  IBM  graduate  fellowship. 
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1  Introduction 
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The  design  and  verification  of  fault-tolerant  distributed  applications  is  widely  viewed 
as  a  complex  endeavour.  In  recent  years,  several  paradigms  have  been  identified  which 
simplify  this  task.  Key  among  these  are  Consensus  and  Atomic  Broadcast.  Roughly 
speaking,  Consensus  allows  processes  to  reach  a  common  decision,  which  depends  on  their 
initial  inputs,  despite  failures.  Consensus  algorithms  can  be  used  to  solve  many  problems 
that  arise  in  practice,  such  as  electing  a  leader  or  agreeing  on  the  value  of  a  replicated 
sensor.  Atomic  Broadcast  allows  processes  to  reliably  broadcast  messages,  so  that  they 
agree  on  the  set  of  messages  they  deliver  and  the  order  of  message  deliveries.  Applications 
based  on  these  paradigms  include  SIFT  [WLG+78],  State  Machines  [Lam78,  Sch90],  Isis 
[BJ87,  BCJ+90],  Psync  [PBS89],  Amoeba  [Mul87],  Delta-4  [Pow91],  Transis  [ADKM91], 
HAS  [Cri87],  FAA  [CDD90],  and  Atomic  Commitment. 

Given  their  wide  applicability,  Consensus  and  Atomic  Broadcast  have  been  exten¬ 
sively  studied  by  both  theoretical  and  experimental  researchers  for  over  a  decade.  In  this 
paper,  we  focus  on  solutions  to  Consensus  and  Atomic  Broadcast  in  the  asynchronous 
model  of  distributed  computing.  Informally,  a  distributed  system  is  asynchronous  if 
there  is  no  bound  on  message  delay,  clock  drift,  or  the  time  necessary  to  execute  a  step. 
Thus,  to  say  that  a  system  is  asynchronous  is  to  make  no  timing  assumptions  whatso¬ 
ever.  This  model  is  attractive  and  has  recently  gained  much  currency  for  several  reasons: 
It  has  simple  semantics;  applications  programmed  on  the  basis  of  this  model  are  easier 
to  port  than  those  incorporating  specific  timing  assumptions;  and  in  practice,  variable 
or  unexpected  workloads  are  sources  of  asynchrony — thus  synchrony  assumptions  are  at 
best  probabilistic. 

Although  the  asynchronous  model  of  computation  is  attractive  for  the  reasons  out¬ 
lined  above,  it  is  well  known  that  Consensus  and  Atomic  Broadcast  cannot  be  solved 
deterministically  in  an  asynchronous  system  that  is  subject  to  even  a  single  crash  fail¬ 
ure  [FLP85,  DDS87].1  Essentially,  the  impossibility  results  for  Consensus  and  Atomic 
Broadcast  stem  from  the  inherent  difficulty  of  determining  whether  a  process  has  actually 
crashed  or  is  only  “very  slow” . 

To  circumvent  these  impossibility  results,  previous  research  focused  on  the  use  of 
randomization  techniques  [CD89],  the  definition  of  some  weaker  problems  and  their  so¬ 
lutions  [DLP+86,  ABD+87,  BW87,  BMZ88],  or  the  study  of  several  models  of  partial 
synchrony  [DDS87,  DLS88].  Nevertheless,  the  impossibility  of  deterministic  solutions 
to  many  agreement  problems  (such  as  Consensus  and  Atomic  Broadcast)  remains  a 
major  obstacle  to  the  use  of  the  asynchronous  model  of  computation  for  fault-tolerant 
distributed  computing. 

In  this  paper,  we  propose  an  alternative  approach  to  circumvent  such  impossibility 
results,  and  to  broaden  the  applicability  of  the  asynchronous  model  of  computation. 
Since  impossibility  results  for  asynchronous  systems  stem  from  the  inherent  difficulty  of 

1  Roughly  speaking,  a  crash  failure  occurs  when  a  process  that  has  been  executing  correctly,  stops 
prematurely.  Once  a  process  crashes,  it  does  not  recover. 
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determining  whether  a  process  has  actually  crashed  or  is  only  “very  slow” ,  we  propose 
to  augment  the  asynchronous  model  of  computation  with  a  model  of  an  external  failure 
detection  mechanism  that  can  make  mistakes.  In  particular,  we  model  the  concept  of 
unreliable  failure  detectors  for  systems  with  crash  failures. 

We  consider  distributed  failure  detectors:  each  process  has  access  to  a  local  failure 
detector  module.  Each  local  module  monitors  a  subset  of  the  processes  in  the  system, 
and  maintains  a  list  of  those  that  it  currently  suspects  to  have  crashed.  We  assume  that 
each  failure  detector  module  can  make  mistakes  by  erroneously  adding  processes  to  its 
list  of  suspects:  i.e,  it  can  suspect  that  a  process  p  has  crashed  even  though  p  is  still 
running.  If  this  module  later  believes  that  suspecting  p  was  a  mistake,  it  can  remove  p 
from  its  list.  Thus,  each  module  may  repeatedly  add  and  remove  processes  from  its  list 
of  suspects.  Furthermore,  at  any  given  time  the  failure  detector  modules  at  two  different 
processes  may  have  different  lists  of  suspects. 

It  is  important  to  note  that  the  mistakes  made  by  an  unreliable  failure  detector 
should  not  prevent  any  correct  process  from  behaving  according  to  specification  even  if 
that  process  is  (erroneously)  suspected  to  have  crashed  by  all  the  other  processes.  For 
example,  consider  an  algorithm  that  uses  a  failure  detector  to  solve  Atomic  Broadcast 
in  an  asynchronous  system.  Suppose  all  the  failure  detector  modules  wrongly  (and  per¬ 
manently)  suspect  that  correct  process  p  has  crashed.  The  Atomic  Broadcast  algorithm 
must  still  ensure  that  p  delivers  the  same  set  of  messages,  in  the  same  order,  as  all  the 
other  correct  processes.  Furthermore,  if  p  broadcasts  a  message  m,  all  correct  processes 
must  deliver  m.2 

We  define  failure  detectors  in  terms  of  abstract  properties  as  opposed  to  giving  spe¬ 
cific  implementations ;  the  hardware  or  software  implementation  of  failure  detectors  is 
not  the  concern  of  this  paper.  This  approach  allows  us  to  design  applications  and  prove 
their  correctness  relying  solely  on  these  properties,  without  referring  to  low-level  net¬ 
work  parameters  (such  as  the  exact  duration  of  time-outs  that  are  used  to  implement 
failure  detectors).  This  makes  the  presentation  of  applications  and  their  proof  of  cor¬ 
rectness  more  modular.  Our  approach  is  well-suited  to  model  many  existing  systems 
that  decouple  the  design  of  fault-tolerant  applications  from  the  underlying  failure  de¬ 
tection  mechanisms,  such  as  the  Isis  Toolkit  [BCJ+90]  for  asynchronous  fault-tolerant 
distributed  computing. 

We  characterize  a  failure  detector  by  specifying  the  completeness  property  and  ac¬ 
curacy  property  that  it  must  satisfy.  Informally,  completeness  requires  that  the  failure 
detector  eventually  suspects  every  process  that  actually  crashes,3  while  accuracy  restricts 
the  mistakes  that  a  failure  detector  can  make.  We  define  two  completeness  and  four  ac¬ 
curacy  properties,  which  gives  rise  to  eight  failure  detectors,  and  consider  the  problem 

JA  different  approach  was  taken  by  the  Isis  system  [RB91]:  a  correct  process  that  is  wrongly  suspected 
to  have  crashed,  is  forced  to  leave  the  system.  In  other  words,  the  Isis  failure  detector  forces  the  system 
to  conform  to  its  view.  To  applications  such  a  failure  detector  makes  no  mistakes.  For  a  more  detailed 
discussion  on  this,  see  Section  8.3. 

’In  this  introduction,  we  say  that  the  failure  detector  suspects  that  a  process  p  has  crashed  if  any 
local  failure  detector  module  suspects  that  p  has  crashed. 
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of  solving  Consensus  using  each  one  of  them.4 

To  do  so,  we  first  introduce  the  concept  of  “reducibility”  among  failure  detectors. 
Informally,  a  failure  detector  V  is  reducible  to  failure  detector  V  if  there  is  a  distributed 
algorithm  that  can  use  V  to  emulate  V.  Given  this  reduction  algorithm,  anything  that 
can  be  done  using  failure  detector  V,  can  be  done  using  V  instead.  Two  failure  detectors 
are  equivalent  if  they  are  reducible  to  each  other.  Using  this  concept,  we  partition  our 
eight  failure  detectors  into  four  equivalence  classes,  and  consider  how  to  solve  Consensus 
for  each  class. 

We  show  that  only  four  of  our  eight  failure  detectors  can  be  used  to  solve  Consensus 
in  systems  in  which  any  number  of  processes  may  crash.  However,  if  we  assume  that  a 
majority  of  the  processes  do  not  crash,  then  any  of  our  eight  failure  detectors  can  be 
used  to  solve  Consensus.  In  order  to  better  understand  where  the  majority  requirement 
becomes  necessary,  we  study  an  infinite  hierarchy  of  failure  detectors  that  contains  the 
eight  failure  detectors  mentioned  above,  and  show  exactly  where  in  this  hierarchy  the 
majority  requirement  becomes  necessary. 

Of  special  interest  is  OW,  the  weakest  failure  detector  considered  in  this  paper. 
Informally,  OW  satisfies  the  following  two  properties: 

•  Completeness :  There  is  a  time  after  which  every  process  that  crashes  is  always 
suspected  by  some  correct  process. 

•  Accuracy.  There  is  a  time  after  which  some  correct  process  is  never  suspected  by 
any  correct  process. 

The  failure  detector  OW  can  make  an  infinite  number  of  mistakes:  Each  local  failure 
detector  module  of  OW  can  repeatedly  add  and  then  remove  correct  processes  from  its 
list  of  suspects  (this  reflects  the  inherent  difficulty  of  determining  whether  a  process  or 
a  link  is  just  slow  or  whether  it  has  crashed).  Moreover,  some  correct  processes  may  be 
erroneously  suspected  to  have  crashed  by  all  the  other  processes  throughout  the  entire 
execution. 

The  two  properties  of  OW  state  that  eventually  something  must  hold  forever;  this 
may  appear  too  strong  a  requirement  to  implement  in  practice.  However,  when  solving  a 
problem  that  “terminates” ,  such  as  Consensus,  it  is  not  really  required  that  the  properties 
hold  forever ,  but  merely  that  they  hold  for  a  sufficiently  long  time ,  i.e.,  long  enough  for 
the  algorithm  that  uses  the  failure  detector  to  achieve  its  goal.  For  instance,  in  practice 
our  algorithm  that  solves  Consensus  using  OW  only  needs  the  two  properties  of  OW 
to  hold  for  a  relatively  short  period  of  time.  However,  in  an  asynchronous  system  it  is 
not  possible  to  quantify  “sufficiently  long”,  since  even  a  single  process  step  or  a  single 
message  transmission  is  allowed  to  take  an  arbitrarily  long  amount  of  time.  Thus,  it  is 
convenient  to  state  the  properties  of  OW  in  the  stronger  form  given  above. 

4We  later  show  that  Consensus  and  Atomic  Broadcast  are  equivalent  in  asynchronous  systems:  any 
Consensus  algorithm  can  be  transformed  into  an  Atomic  Broadcast  algorithm  and  vice  versa.  Thus,  we 
can  focus  on  Consensus  since  all  our  results  will  automatically  apply  to  Atomic  Broadcast  as  well. 
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Another  advantage  of  using  OW  (as  opposed  to  stronger  failure  detectors)  is  the  fol¬ 
lowing.  Consider  an  application  that  relies  on  OW  for  its  correctness.  If  this  application 
is  run  in  a  system  in  which  the  failure  detector  “malfunctions”  and  fails  to  meet  the 
specification  of  OW,  then  we  may  lose  the  liveness  properties  of  the  application,  but  its 
safety  properties  will  never  be  violated. 

The  failure  detector  abstraction  is  a  clean  extension  to  the  asynchronous  model  of 
computation  that  allows  us  to  solve  many  problems  that  are  otherwise  unsolvable.  Nat¬ 
urally,  the  question  arises  of  how  to  support  such  an  abstraction  in  an  actual  system. 
Since  we  specify  failure  detectors  in  terms  of  abstract  properties,  we  are  not  committed 
to  a  particular  implementation.  For  instance,  one  could  envision  specialised  hardware 
to  support  this  abstraction.  However,  most  implementations  of  failure  detectors  are 
based  on  time-out  mechanisms.  For  the  purpose  of  illustration,  we  now  outline  one  such 
implementation  of  OW. 

Informally,  if  a  process  times-out  on  some  process  q ,  it  adds  q  to  its  list  of  suspects, 
and  it  broadcasts  a  message  to  all  processes  (including  q)  with  this  information.  Any 
process  that  receives  this  broadcast  adds  q  to  its  list  of  suspects.  If  q  has  not  crashed,  it 
broadcasts  a  refutation.  If  a  process  receives  q'&  refutation,  it  removes  q  from  its  list  of 
suspects. 

In  the  purely  asynchronous  system,  this  scheme  does  not  implement  OW:5  an  un¬ 
bounded  sequence  of  premature  time-outs  (with  corresponding  refutations)  may  cause 
every  correct  process  to  be  repeatedly  added  and  then  removed  from  every  correct  pro¬ 
cess’  list  of  suspects,  thereby  violating  the  accuracy  property  of  OW.  Nevertheless,  in 
many  practical  systems,  one  can  choose  the  time-out  periods  so  that  eventually  there 
are  no  premature  time-outs  on  at  least  one  correct  process  p.  This  gives  the  accuracy 
property  of  OW:  there  is  a  time  after  which  p  is  permanently  removed  from  all  the  lists 
of  suspects.  Recall  that,  in  practice,  it  is  not  necessary  for  this  to  hold  permanently;  it 
is  sufficient  that  it  holds  “long  enough”  for  the  application  using  the  failure  detector  to 
complete  its  task.  Accordingly,  it  is  not  necessary  for  the  premature  time-outs  on  p  to 
cease  permanently:  it  is  sufficient  that  they  cease  for  “long  enough” . 

Having  made  the  point  that  OW  can  be  implemented  in  practical  systems  using 
time-outs,  we  reiterate  that  all  reasoning  about  failure  detectors  (and  algorithms  that 
use  them)  should  be  done  in  terms  of  their  abstract  properties  and  not  in  terms  of  any 
particular  implementation.  This  is  an  important  feature  of  this  approach,  and  the  reader 
should  refrain  from  thinking  of  failure  detectors  in  terms  of  specific  time-out  mechanisms. 

The  failure  detection  information  provided  by  OW,  the  weakest  failure  detector  con¬ 
sidered  in  this  paper,  is  sufficient  to  solve  Consensus.  But  is  it  necessary ?  In  other  words, 
is  it  possible  to  solve  Consensus  with  a  failure  detector  that  provides  less  information 
about  failures  than  OW?  Indeed,  what  it  is  the  “weakest”  failure  detector  for  solving 
Consensus?  In  [CHT92],  we  show  that  OW  is  the  weakest  failure  detector  that  can  be 

’Indeed,  no  scheme  could  implement  OW  in  the  purely  asynchronous  system:  as  we  show  in  Section 
0.2,  such  an  implementation  could  be  used  to  solve  Consensus  in  such  a  system,  contradicting  the 
impossibility  result  of  [FLP85]. 
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used  to  solve  Consensus  in  asynchronous  systems  (with  a  majority  of  correct  processes). 
More  precisely,  we  show  how  to  emulate  OW  using  any  failure  detector  Z>  that  can  be 
used  to  solve  Consensus.  Thus,  in  a  precise  sense,  OW  is  necessary  and  sufficient  for 
solving  Consensus  in  asynchronous  systems  (with  a  majority  of  correct  processes).  This 
result  is  further  evidence  to  the  importance  of  OW  for  fault-tolerance  in  asynchronous 
distributed  computing. 

In  our  discussion  so  far,  we  focused  on  the  Consensus  problem.  In  Section  7,  we 
show  that  Consensus  is  equivalent  to  Atomic  Broadcast  in  asynchronous  systems  with 
crash  failures.  This  is  shown  by  reducing  each  problem  to  the  other.6  In  other  words,  a 
solution  for  one  automatically  yields  a  solution  for  the  other.  Both  reductions  apply  to 
any  asynchronous  system  (in  particular,  they  do  not  require  the  assumption  of  a  failure 
detector).  Thus,  Atomic  Broadcast  can  be  solved  using  the  unreliable  failure  detectors 
described  in  this  paper.  Furthermore,  OW  is  the  weakest  failure  detector  that  can  be 
used  to  solve  Atomic  Broadcast. 

A  different  tack  on  circumventing  the  unsolvability  of  Consensus  is  pursued  in  [DDS87] 
and  [DLS88].  The  approach  of  those  papers  is  based  on  the  observation  that  between 
the  completely  synchronous  and  completely  asynchronous  models  of  distributed  systems 
there  lie  a  variety  of  intermediate  partially  synchronous  models.  In  particular,  those  two 
papers  consider  34  different  models  of  partial  synchrony  and  for  each  model  determine 
whether  or  not  Consensus  can  be  solved.  In  this  paper,  we  argue  that  partial  synchrony 
assumptions  can  be  encapsulated  in  the  unreliability  of  the  failure  detector.  For  example, 
we  show  how  to  implement  one  of  our  failure  detectors  (which  is  stronger  than  OW),  in 
the  models  of  partial  synchrony  considered  in  [DLS88].  This  immediately  implies  that 
Consensus  and  Atomic  Broadcast  can  be  solved  in  these  models.  Thus,  our  approach 
can  be  used  to  unify  several  seemingly  unrelated  models  of  partial  synchrony.7 

As  we  argued  earlier,  using  the  asynchronous  model  of  computation  is  highly  desirable 
in  many  applications:  it  results  in  code  that  is  simple,  portable  and  robust.  However, 
the  fact  that  fundamental  problems  such  as  Consensus  and  Atomic  Broadcast  have  no 
(deterministic)  solutions  in  this  model  is  a  major  obstacle  to  its  use  in  fault-tolerant 
distributed  computing.  Our  model  of  unreliable  failure  detectors  provides  a  natural  and 
simple  extension  of  the  asynchronous  model  of  computation,  in  which  Consensus  and 
Atomic  Broadcast  can  be  solved  deterministically.  Thus,  this  extended  model  retains 
the  advantages  of  asynchrony  without  inheriting  its  disadvantages.  We  believe  that 
this  approach  is  an  important  contribution  towards  bridging  the  gap  between  known 
theoretical  impossibility  results  and  the  need  for  fault-tolerant  solutions  in  real  systems. 

The  remainder  of  this  paper  is  organised  as  follows.  In  Section  2,  we  describe  our 
model  and  introduce  eight  failure  detectors  in  terms  of  their  abstract  properties.  In 
Section  3,  we  show  that  these  eight  failure  detectors  fall  into  four  equivalence  classes — 
this  allows  us  to  focus  on  four  failure  detectors  rather  than  eight.  In  Section  4,  we 

•They  are  actually  equivalent  even  in  asynchronous  systems  with  arbitrary,  i.e.,  “Byzantine”,  failures. 
However,  that  reduction  is  more  complex  and  is  omitted  from  this  paper. 

7For  a  more  detailed  discussion  on  this,  see  Section  8. 
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present  Reliable  Broadcast ,  a  communication  primitive  that  several  of  our  algorithms 
use.  In  Section  5,  we  define  the  Consensus  problem.  In  Section  6,  we  show  how  to 
solve  Consensus  for  each  one  of  the  four  equivalence  classes  of  failure  detectors.  In 
Section  7,  we  show  that  Consensus  and  Atomic  Broadcast  are  equivalent  to  each  other 
in  asynchronous  systems.  In  Section  8,  we  discuss  related  work,  and  in  particular,  we 
describe  an  implementation  of  an  unreliable  failure  detector  that  is  more  powerful  than 
OW,  in  several  models  of  partial  synchrony.  In  the  Appendix  we  define  an  infinite 
hierarchy  of  failure  detectors,  and  determine  exactly  where  in  this  hierarchy  a  majority 
of  correct  processes  is  required  to  solve  Consensus. 

2  The  model 

We  consider  asynchronous  distributed  systems  in  which  there  is  no  bound  on  message 
delay,  clock  drift,  or  the  time  necessary  to  execute  a  step.  Our  model  of  asynchronous 
computation  with  failure  detection  is  patterned  after  the  one  in  [FLP85].  The  system 
consists  of  a  set  of  n  processes ,  II  =  {pi,P2,  •  •  •  ,pn}-  Every  pair  of  processes  is  connected 
by  a  reliable  communication  channel. 

To  simplify  the  presentation  of  our  model,  we  assume  the  existence  of  a  discrete  global 
clock.  This  is  merely  a  fictional  device:  the  processes  do  not  have  access  to  it.  We  take 
the  range  T  of  the  clock’s  ticks  to  be  the  set  of  natural  numbers. 

2.1  Failures  and  failure  patterns 

Processes  can  fail  by  crashing ,  i.e.,  by  prematurely  halting.  A  failure  pattern  F  is  a 
function  from  T  to  2n,  where  F(t)  denotes  the  set  of  processes  that  have  crashed  through 
time  t.  Once  a  process  crashes,  it  does  not  “recover”,  i.e.,  Vf  :  F(t)  C  F(t  +  1).  We 
define  crashed(F)  =  U teT  F(t)  and  correct(F)  =  II  —  crashed(F).  If  p  €  crashed(F)  we 
say  p  crashes  in  F  and  if  p  €  correct(F)  we  say  p  is  correct  in  F.  We  consider  only 
failure  patterns  F  such  that  at  least  one  process  is  correct,  i.e.,  correct(F)  ^  0. 

2.2  Failure  detectors 

Each  failure  detector  module  outputs  the  set  of  processes  that  it  currently  suspects  to 
have  crashed.8  A  failure  detector  history  H  is  a  function  from  II  x  T  to  2n.  H(p,  t)  is  the 
value  of  the  failure  detector  module  of  process  p  at  time  t.  If  q  €  H(p,t),  we  say  that  p 
suspects  q  at  time  t  in  H.  We  omit  references  to  H  when  it  is  obvious  from  the  context. 
Note  that  the  failure  detector  modules  of  two  different  processes  need  not  agree  on  the 
list  of  processes  that  are  suspected  to  have  crashed,  i.e.,  if  p  ±  q  then  H(p,  t)  ^  H(q,t) 
is  possible. 

*In  [CHT92]  we  study  a  more  general  class  of  failure  detectors:  their  modules  can  output  values  from 
an  arbitrary  range. 
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Informally,  a  failure  detector  V  provides  (possibly  incorrect)  information  about  the 
failure  pattern  F  that  occurs  in  an  execution.  Formally,  failure  detector  V  is  a  function 
that  maps  each  failure  pattern  F  to  a  set  of  failure  detector  histories  V(F).  This  is  the 
set  of  all  failure  detector  histories  that  could  occur  in  executions  with  failure  pattern  F 
and  failure  detector  T> .9 

In  this  paper,  we  do  not  define  failure  detectors  in  terms  of  specific  implementations. 
Such  implementations  would  have  to  refer  to  low-level  network  parameters,  such  as  the 
network  topology,  the  message  delays,  and  the  accuracy  of  the  local  clocks.  To  avoid  this 
problem,  we  specify  a  failure  detector  in  terms  of  two  abstract  properties  that  it  must 
satisfy:  completeness  and  accuracy.  This  allows  us  to  design  applications  and  prove  their 
correctness  relying  solely  on  these  properties. 

2.3  Failure  detector  properties 

2.3.1  Completeness 

We  consider  two  completeness  properties: 

•  Strong  completeness-.  Eventually  every  process  that  crashes  is  permanently  sus¬ 
pected  by  every  correct  process.  Formally,  V  satisfies  strong  completeness  if: 

VF,Vff  e  Z>(F),3f  €  T,Vp  €  crashed(F),Vq  €  correct(F),Vt'  >  t :  p  e  H(q,t') 

•  Weak  completeness-.  Eventually  every  process  that  crashes  is  permanently  sus¬ 
pected  by  some  correct  process.  Formally,  V  satisfies  weak  completeness  if: 

VF,VH  €  V(F),  3 1  crashed(F),  3 q  6  correct(F),Vt'  >t:p€  H(q,t') 

However,  completeness  by  itself  is  not  a  useful  property.  To  see  this,  consider  a  failure 
detector  which  causes  every  process  to  permanently  suspect  every  other  process  in  the 
system.  Such  a  failure  detector  trivially  satisfies  strong  completeness  but  is  dearly  useless 
since  it  provides  no  information  about  failures.  To  be  useful,  a  failure  detector  must  also 
satisfy  some  accuracy  property  that  restricts  the  mistakes  that  it  can  make.  We  now 
consider  such  properties. 

2.3.2  Accuracy 

Consider  the  following  two  accuracy  properties: 

*In  general,  there  are  many  executions  with  the  same  failure  pattern  F  (e.g,  these  executions  may 
differ  by  the  pattern  of  their  message  exchange).  For  each  such  execution,  V  may  give  a  different  failure 
detector  history. 
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•  Strong  accuracy.  No  process  is  suspected  before  it  crashes.  Formally,  25  satisfies 
strong  accuracy  if: 

VF.Vtf  E  2 5(F),  Vt  G  T,  Vp,  <7  E  II  —  F(t)  :  p  $  H{q,t) 

Since  it  is  difficult  (if  not  impossible)  to  achieve  strong  accuracy  in  many  practical 
systems,  we  also  define: 

•  Weak  accuracy.  Some  correct  process  is  never  suspected.  Formally,  25  satisfies 
weak  accuracy  if: 

VF,Vtf  G  D(F),  3 p  G  correct(F)t  Vi  G  T,  Vg  E  II  —  F(t)  :  p  £  ff(9,t) 

Even  weak  accuracy  guarantees  that  at  least  one  correct  process  is  never  suspected. 
Since  this  type  of  accuracy  may  be  difficult  to  achieve,  we  consider  failure  detectors 
that  may  suspect  every  process  at  one  time  or  another.  Informally,  we  only  require  that 
strong  accuracy  or  weak  accuracy  are  eventually  satisfied.  The  resulting  properties  are 
called  eventual  strong  accuracy  and  eventual  weak  accuracy ,  respectively. 

For  example,  eventual  strong  accuracy  requires  that  there  is  a  time  after  which  strong 
accuracy  holds.  Formally,  25  satisfies  eventual  strong  accuracy  if: 

VF,V2f  G  25(F),  3 1  G  T,  Vi'  >  t,  Vp,  q  G  II  —  F(f')  :  p  ?  H(q,t') 

An  observation  is  now  in  order.  Since  all  faulty  processes  will  crash  after  some  finite 
time,  we  have: 


VF,  3t  G  T,  Vt'  >  t :  n  -  F(t')  =  correct(F) 

Thus,  an  equivalent  and  simpler  formulation  of  eventual  strong  accuracy  is: 

•  Eventual  st^  ng  accuracy.  There  is  a  time  after  which  correct  processes  axe  not 
suspected  by  any  correct  process.  Formally,  2?  satisfies  eventual  strong  accuracy 
if: 


VF,Vff  G  25(F),  3t  E  T,Vt'  >  f,Vp,  q  E  correct(F)  :  p  £  H(q,t') 

Similarly,  we  specify  eventual  weak  accuracy  as  follows: 

•  Eventual  weak  accuracy.  There  is  a  time  after  which  some  correct  process  is  never 
suspected  by  any  correct  process.  Formally,  25  satisfies  eventual  weak  accuracy  if: 

VF,Vif  G  25(F),  3t  G  T,  3p  G  correct F),Vt'  >  t,Vg  E  correct{F)  :  p  £  H(q,t') 

We  will  refer  to  eventual  strong  accuracy  and  eventual  weak  accuracy  as  eventual 
accuracy  properties,  and  strong  accuracy  and  weak  accuracy  as  perpetual  accuracy  prop¬ 
erties. 
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Figure  1:  Some  failure  detector  specifications  based  on  accuracy  and  completeness. 

2.4  Some  failure  detector  definitions 

A  failure  detector  can  be  specified  by  stating  the  completeness  property  and  the  accuracy 
property  that  it  must  satisfy.  Combining  the  two  completeness  properties  with  the  four 
accuracy  properties  that  we  defined  in  the  previous  section  gives  rise  to  the  eight  different 
failure  detectors  defined  in  Figure  1.  For  example,  we  say  that  a  failure  detector  is 
Eventually  Strong  if  it  satisfies  strong  completeness  and  eventual  weak  accuracy.  We 
denote  such  a  failure  detector  by  OS. 

2.5  Algorithms  and  runs 

In  this  paper,  we  focus  on  algorithms  that  use  unreliable  failure  detectors.  To  describe 
such  algorithms,  we  only  need  informal  definitions  of  algorithms  and  runs,  based  on  the 
formal  definitions  given  in  [CHT92].10 

An  algorithm  A  is  a  collection  of  n  deterministic  automata,  one  for  each  process  in 
the  system.  Computation  proceeds  in  steps  of  A.  In  each  step,  a  process  (1)  may  receive 
a  message  that  was  sent  to  it,  (2)  queries  its  failure  detector  module,  (3)  undergoes  a 
state  transition,  and  (4)  may  send  a  message.  Since  we  model  asynchronous  systems, 
messages  may  experience  arbitrary  (but  finite)  delays.  Furthermore,  there  is  no  bound 
on  relative  process  speeds. 

Informally,  a  run  of  algorithm  A  using  a  failure  detector  V  is  a  tuple  R  =  (F,  Hv,  /,  5,  T) 
where  F  is  a  failure  pattern,  Hv  €  D(F)  is  a  history  of  failure  detector  V  for  failure 
pattern  F,  I  is  an  initial  configuration  of  A,  5  is  an  infinite  sequence  of  steps  of  A,  and 
T  is  a  list  of  increasing  time  values  indicating  when  each  step  in  5  occurred.  A  run 
must  satisfy  certain  well-formedness  and  fairness  properties.  In  particular,  ( 1 )  a  process 
cannot  take  a  step  after  it  crashes,  (2)  when  a  process  takes  a  step  and  queries  its  failure 
detector  module,  it  gets  the  current  value  output  by  its  local  failure  detector  module, 
and  (3)  every  process  that  is  correct  in  F  takes  an  infinite  number  of  steps  in  S  and 


1#Formal  definitions  are  necessary  in  [CHT92]  to  prove  a  subtle  lower  bound. 
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Figure  2:  Transforming  V  into  V 


eventually  receives  every  message  sent  to  it. 

We  use  the  following  notation.  Let  v  be  a  variable  in  algorithm  A.  We  denote  by  vp 
process  p’s  copy  of  v.  The  history  of  v  in  run  R  is  denoted  by  vR,  i.e.,  vR(p,t)  is  the 
value  of  vp  at  time  t  in  run  R.  We  denote  by  Vp  process  p’s  local  failure  detector  module. 
Thus,  the  value  of  Vp  at  time  t  in  run  R  =  {F,  Hp,  I,  S,  T)  is  Hv{p,  <)• 

2.6  Reducibility 

We  now  define  what  it  means  for  an  algorithm  to  transform  a  failure  detector 

V  into  another  failure  detector  V.  Algorithm  Tp— v  uses  V  to  maintain  a  variable 
output^  at  every  process  p.  This  variable,  reflected  in  the  local  state  of  p,  emulates  the 
output  of  V  at  p.  Algorithm  transforms  V  into  V  if  and  only  if  for  every  run 

R  =  ( F,H-d,I,S,T )  otTv^v  using  V ,  output R  €  V(F). 

Given  the  reduction  algorithm  Tv-,v »  anything  that  can  be  done  using  failure  detec¬ 
tor  iy,  can  be  done  using  V  instead.  To  see  this,  suppose  a  given  algorithm  B  requires 
failure  detector  V,  but  only  V  is  available.  We  can  still  execute  B  as  follows.  Concur¬ 
rently  with  B ,  processes  run  to  transform  V  into  V .  We  modify  Algorithm  B  at 

process  p  as  follows:  whenever  p  is  required  to  query  its  failure  detector  module,  p  reads 
the  current  value  of  outputp  (which  is  concurrently  maintained  by  Tp_x>»)  instead.  This 
is  illustrated  in  Figure  2. 

Intuitively,  since  Tp— v  is  able  to  use  V  to  emulate  V,  V  provides  at  least  as  much 
information  about  process  failures  as  V  does.  Thus,  if  there  is  an  algorithm  that 

transforms  V  into  V,  we  write  V  £  V  and  say  that  V  is  reducible  to  V;  we  also  say 
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that  V  is  weaker  than  T>.  If  V  y  V  and  V  y  V,  we  write  V  =  V  and  say  that  V  and 
V  are  equivalent 

Note  that,  in  general,  need  not  emulate  all  the  failure  detector  histories  of  V'\ 

what  we  do  require  is  that  all  the  failure  detector  histories  it  emulates  be  histories  of  V. 

Consider  the  “identity”  transformation  Tt>->v  in  which  each  process  p  periodically 
writes  the  current  value  output  by  its  local  failure  detector  module  into  outputp.  The 
following  is  immediate  from  and  the  definition  of  reducibility. 

Observation  1:  V  X  Q,  S  y  W,  OV  >0  Q,  OS  t.  OW. 

3  From  weak  completeness  to  strong  completeness 

In  Figure  3,  we  give  a  reduction  algorithm  Tv-*w  that  transforms  any  given  failure 
detector  V  that  satisfies  weak  completeness,  into  a  failure  detector  V  that  satisfies  strong 
completeness.  Furthermore,  for  each  failure  detector  V  defined  in  Figure  1,  Tp_p-  gives  a 
failure  detector  V  that  has  the  same  accuracy  property  as  V.  Roughly  speaking,  Tp_p< 
strengthens  the  completeness  property  while  preserving  accuracy. 

This  result  allows  us  to  focus  on  the  failure  detectors  that  are  defined  in  the  first  row 
of  Figure  1,  i.e.,  those  with  strong  completeness.  This  is  because,  Tp_p<  (together  with 
Observation  1)  shows  that  every  failure  detector  in  the  second  row  of  Figure  1  is  actually 
equivalent  to  the  corresponding  failure  detector  above  it  in  that  figure. 

Informally,  TV-^v  works  as  follows.  Every  process  p  periodically  sends 
( p ,  suspect8p) — where  suspectsp  denotes  the  set  of  processes  that  p  suspects  according 
to  its  local  failure  detector  module — to  all  the  processes.  When  a  process  q  receives 
a  message  of  the  form  (p,  suspectsp),  it  adds  suspectsp  to  outputq  and  removes  p  from 
outputq. 

Let  R  =  (F,  ifp,/,  S,  T)  be  an  arbitrary  run  of  Tp_*p»  using  failure  detector  V.  In 
the  following,  the  run  R  and  its  failure  pattern  F  are  fixed.  Thus,  when  we  say  that  a 
process  crashes  we  mean  that  it  crashes  in  F.  Similarly,  when  we  say  that  a  process  is 
correct,  we  mean  that  it  is  correct  in  F.  We  will  show  that  output !*  satisfies  the  following 
properties: 

Pi  :  (Transforming  weak  completeness  into  strong  completeness)  Let  p  be  any  process 
that  crashes.  If  eventually  some  correct  process  permanently  suspects  p  in  Hp,  then 
eventually  all  correct  processes  permanently  suspect  p  in  output?.  More  formally: 

Vp  €  crashed(F) : 

3t  €  T,  3q  e  correct^ F),  Vt'  >t:pe  Hx>(q,t') 

=>  3t  €  T,  Vq  €  correct(F)yt'  >t:p£  outpu1?(q,t J) 

P2  :  (Preserving  perpetual  accuracy)  Let  p  be  any  process.  If  no  process  suspects  p 
in  Ht>  before  time  t,  then  no  process  suspects  p  in  output?  before  time  t.  More 
formally: 
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Every  process  p  executes  the  following: 

outputp  4—  0 

cobegin 

||  Task  1:  repeat  forever 

{p  queries  its  local  failure  detector  module  Vp} 

suspectsp  <—  Vp 

send  ( p ,  suspectSp)  to  all 

||  Task  2:  when  receive  ( q ,  suspectsq)  for  some  q 
outputp  4—  ( outputp  U  suspectsq)  -  {9} 

coend 

Figure  3:  '■  From  Weak  Completeness  to  Strong  Completeness 

Vp  €  II,  Vt  €  T  : 

Vt'  <  t,  V9  €  n  -  Fit1)  :  p  £  H-p(q,t') 

=>  Vt'  <  t, V9  €  II  —  F(t ?)  :  p  £  outpu1?(q,t') 

P3  :  (Preserving  eventual  accuracy)  Let  p  be  any  correct  process.  If  there  is  a  time 
after  which  no  correct  process  suspects  p  in  H-p,  then  there  is  a  time  after  which 
no  correct  process  suspects  p  in  output ?.  More  formally: 

Vp  €  correct(F) : 

3 1  €T,Vq  6  correct(F),Vt  >t:p&  Hp(q,t') 

=>  3 1  €  T,  V9  €  correct(F),  Vt'  >  t :  p  £  outpulP  fat') 

Lemma  2:  satisfies  PI. 

Proof:  Let  p  be  any  process  that  crashes.  Suppose  that  there  is  a  time  t  after  which 
some  correct  process  q  permanently  suspects  p  in  Hp.  We  must  show  that  there  is  a 
time  after  which  every  correct  process  suspects  p  in  output?. 

Since  p  crashes,  there  is  a  time  t'  after  which  no  process  receives  a  message  from 
p.  Consider  the  execution  of  Task  1  by  process  q  after  time  tp  —  ma x(t,t').  Process  q 
sends  a  message  of  the  type  ( q ,  suspectsq)  with  p  €  suspectsq  to  all  processes.  Eventually, 
every  correct  process  receives  ( q ,  suspectsq)  and  adds  p  to  output  (see  Task  2).  Since  no 
correct  process  receives  any  messages  from  p  after  time  t'  and  tp  >  t',  no  correct  process 


14 

removes  p  from  output  after  time  tp.  Thus,  there  is  a  time  after  which  every  correct 
process  permanently  suspects  p  in  output !*.  □ 

Lemma  3:  Tv^v>  satisfies  P2. 

PROOF:  Let  p  be  any  process.  Suppose  that  there  is  a  time  t  before  which  no  process 
suspects  p  in  Ht>-  No  process  sends  a  message  of  the  type  (— ,  suspects)  with  p  £  suspects 
before  time  t.  Thus,  no  process  q  adds  p  to  outputq  before  time  t.  □ 

Lemma  4:  Tt>~. &  satisfies  P3. 

Proof:  Let  p  be  any  correct  process.  Suppose  that  there  is  a  time  t  after  which  no 
correct  process  suspects  p  in  Hp.  Thus,  all  processes  that  suspect  p  after  time  t  eventually 
crash.  Thus,  there  is  a  time  t'  after  which  no  correct  process  receives  a  message  of  the 
type  (— ,  suspects)  with  p  £  suspects. 

Let  q  be  any  correct  process.  We  must  show  that  there  is  a  time  after  which  q  does 
not  suspect  p  in  output11. 

Consider  the  execution  of  Task  1  by  process  p  after  time  t'.  Process  p  sends  a  message 
m  =  (p,  suspectsp)  to  q.  When  q  receives  m,  it  removes  p  from  outputq  (see  Task  2).  Since 
q  does  not  receive  any  messages  of  the  type  (— ,  suspects)  with  p  £  suspects  after  time  t' , 
q  does  not  add  p  to  outputq  after  time  t! .  Thus,  there  is  a  time  after  which  q  does  not 
suspect  p  in  output11.  □ 

Theorem  5:  Tc_z>.  transforms  Q  into  P,  W  into  S,  OQ  into  OP,  and  OW  into  OS. 

Proof:  By  Lemma  2,  Td-,v  transforms  Q,  W,  OQ,  and  OW,  into  failure  detectors 
that  satisfy  strong  completeness.  By  Lemma  3,  T-p-*v  preserves  the  strong  accuracy  of 
Q  and  the  weak  accuracy  of  W.  By  Lemma  4,  Tp—pi  preserves  the  eventual  strong  accu¬ 
racy  of  OQ  and  the  eventual  weak  accuracy  of  OW.  The  theorem  immediately  follows. 
□ 

By  Theorem  5  and  Observation  1,  we  have: 

Corollary  6:  P  S  Q,  S  a  W,  OP  S  OQ,  and  05  *  OW. 

4  Reliable  Broadcast 

We  now  define  Reliable  Broadcast,  a  communication  primitive  that  we  often  use  in 
our  algorithms.  Informally,  Reliable  Broadcast  guarantees  that  (1)  all  correct  processes 
deliver  the  same  set  of  messages,  (2)  all  messages  broadcast  by  correct  processes  are 
delivered,  and  (3)  no  spurious  messages  are  ever  delivered.  Formally,  Reliable  Broadcast 
is  defined  in  terms  of  two  primitives,  R-hroadcast(m)  and  R-deliver(m)  where  m  is  a 
message  drawn  from  a  set  of  possible  messages.  When  a  process  executes  R-broadcast(m), 
we  say  that  it  R-broadcasts  m,  and  when  a  process  executes  R-deliver(m),  we  say  that 
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Every  process  p  executes  the  following : 

To  execute  R-broadcast(m): 
send  m  to  all  (including  p) 

R-deliver(m)  occurs  as  follows: 

when  receive  m  for  the  first  time 

if  sender(m )  ^  p  then  send  m  to  all 
R-deliver{m ) 

Figure  4:  Reliable  Broadcast  by  message  diffusion 

it  R-delivers  m.  Reliable  Broadcast  satisfies  the  following  three  properties:11 

Validity:  If  a  correct  process  R-broadcasts  a  message  m,  then  all  correct  processes 
eventually  R-deliver  m. 

Agreement:  If  a  correct  process  R-delivers  a  message  m,  then  all  correct  processes 
eventually  R-deliver  m. 

Uniform  integrity:  For  any  message  m,  each  process  R-delivers  m  at  most  once,  and 
only  if  m  was  R-broadcast  by  some  process. 

In  Figure  4,  we  give  a  simple  Reliable  Broadcast  algorithm  for  asynchronous  systems. 
Informally,  when  a  process  receives  a  message  for  the  first  time,  it  relays  the  message 
to  all  processes  and  then  R-delivers  it.  It  is  easy  to  show  that  this  algorithm  satisfies 
validity,  agreement  and  uniform  integrity  in  asynchronous  systems  with  up  to  n  - 1  crash 
failures.  The  proof  is  obvious  and  therefore  omitted. 

5  The  Consensus  problem 

In  the  Consensus  problem,  all  correct  processes  propose  a  value  and  must  reach  a  unan¬ 
imous  and  irrevocable  decision  on  some  value  that  is  related  to  the  proposed  values 
(Fis83).  We  define  the  Consensus  problem  in  terms  of  two  primitives,  propose(v )  and 
decide(v),  where  v  is  a  value  drawn  from  a  set  of  possible  proposed  values.  When  a  pro¬ 
cess  executes  propose(v),  we  say  that  it  proposes  v\  similarly,  when  a  process  executes 
decide(v),  we  say  that  it  decides  v.  The  Consensus  problem  is  specified  as  follows: 

11  For  simplicity,  we  assume  that  each  message  is  unique.  In  practice,  this  can  be  achieved  by  tagging 
the  identity  of  the  sender  and  a  sequence  number  on  each  message. 
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Termination:  Every  correct  process  eventually  decides  some  value. 

Uniform  validity:  If  a  process  decides  v ,  then  v  was  proposed  by  some  process.12 
Uniform  integrity:  Every  process  decides  at  most  once. 

Agreement:  No  two  correct  processes  decide  differently. 

It  has  been  proved  that  there  is  no  deterministic  algorithm  for  Consensus  in  asynchronous 
systems  that  are  subject  to  even  a  single  crash  failure  [FLP85,  DDS87].  We  now  show 
how  to  use  unreliable  failure  detectors  to  solve  Consensus  in  asynchronous  systems. 


6  Solving  Consensus  using  unreliable  failure  detec¬ 
tors 

We  now  show  how  to  solve  Consensus  using  each  one  of  the  eight  failure  detectors  defined 
in  Figure  1.  By  Theorem  5,  we  only  need  to  show  how  to  solve  Consensus  using  the  four 
failure  detectors  that  satisfy  strong  completeness,  namely,  P,  5,  OP,  and  OS. 

Solving  Consensus  with  the  Perfect  failure  detector  V  is  simple,  and  is  left  as  an 
exercise  for  the  reader.  In  Section  6.1,  we  give  a  Consensus  algorithm  that  uses  S. 
In  Section  6.2,  we  give  a  Consensus  algorithm  that  uses  OS.  Since  OP  X  05,  this 
algorithm  also  solves  Consensus  with  OP. 

The  Consensus  algorithm  that  uses  5  can  tolerate  any  number  of  failures.  In  con¬ 
trast,  the  one  that  uses  OS  requires  a  majority  of  correct  processes.  We  show  that  this 
requirement  is  necessary  even  if  one  uses  OP,  a  failure  detector  that  is  stronger  than  OS. 
Thus,  our  algorithm  for  solving  Consensus  using  OS  (or  OP)  is  optimal  with  respect  to 
the  number  of  failures  that  it  tolerates. 


6.1  Using  a  Strong  failure  detector  S 

Given  any  Strong  failure  detector  5,  the  algorithm  in  Figure  5  solves  Consensus  in 
asynchronous  systems.  This  algorithm  runs  through  3  phases.  In  Phase  1,  processes 
execute  n  —  1  asynchronous  rounds  (r,,  denotes  the  current  round  number  of  process  p) 
during  which  they  broadcast  and  relay  their  proposed  values.  Each  process  p  waits  until 
it  receives  a  round  r  message  from  every  process  that  is  not  in  before  proceeding  to 
round  r  + 1.  Note  that  it  is  possible  that  while  p  is  waiting  for  a  message  from  q  in  round 
r,  q  is  added  to  Sp.  By  the  above  rule,  p  stops  waiting  for  q' s  message  and  proceeds  to 
round  r  + 1. 

By  the  end  of  Phase  2,  correct  processes  agree  on  a  vector  based  on  the  proposed 
values  of  all  processes.  The  tth  element  of  this  vector  either  contains  the  proposed  value 


l,The  validity  condition  captures  the  relation  between  the  decision  value  and  the  proposed  values. 
Changing  this  condition  results  in  other  types  of  Consensus  [Fis83]. 
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of  process  p,  or  ±.  We  will  show  that  this  vector  contains  the  proposed  value  of  at  least 
one  process.  In  Phase  3,  correct  processes  decide  the  first  non-trivial  component  of  this 
vector. 

Let  /  denote  the  maximum  number  of  processes  that  may  crash.13  Phase  1  of  the 
algorithm  consists  of  n  —  1  rounds,  rather  than  the  usual  /  +  1  rounds  of  traditional 
Consensus  algorithms  (for  synchronous  systems).  Intuitively,  this  is  because  even  a 
correct  process  p  may  be  suspected  to  have  crashed  by  other  processes.  In  this  case, 
p’s  messages  may  be  ignored,  and  p  appears  to  commit  “send-omission”  failures.  Thus, 
up  to  n  —  1  processes  may  appear  to  commit  such  failures  (rather  than  /).  Note  that 
because  S  satisfies  weak  accuracy  (namely,  some  correct  process  is  never  suspected),  the 
maximum  number  of  processes  that  may  fail  or  appear  to  fail  is  n  —  1  rather  than  n. 

Vp[q ]  denotes  p’s  current  estimate  of  q's  proposed  value.  Ap[g]  =  vq  at  the  end  of 
round  r  if  and  only  if  p  receives  vq)  the  value  proposed  by  q,  for  the  first  time  in  round 
r. 

Let  R  =  (F,  Hs,  /,  S,  T)  be  any  run  of  the  algorithm  in  Figure  5  using  S  in  which  all 
correct  processes  propose  a  value.  We  have  to  show  that  termination,  uniform  validity, 
agreement  and  uniform  integrity  hold. 

Lemma  8:  For  all  p  and  q,  and  in  all  phases,  Vp[q[  is  either  vq  or  _L. 

Proof:  Obvious  from  the  algorithm  in  Figure  5.  □ 

Lemma  9:  Every  correct  process  eventually  reaches  Phase  3. 

PROOF:  [sketch]  The  only  way  a  correct  process  p  can  be  prevented  from  reaching  Phase 
3  is  by  blocking  forever  at  one  of  the  two  wait  statements  (in  Phase  1  and  2,  respectively). 
This  can  happen  only  if  p  is  waiting  forever  for  a  message  from  a  process  q  and  q  never 
joins  Sp.  There  are  two  cases  to  consider: 

1.  q  crashes.  Since  S  satisfies  strong  completeness,  there  is  a  time  after  which  q  6  Sp. 

2.  q  does  not  crash.  In  this  case,  we  can  show  (by  an  easy  but  tedious  induction  on 
the  round  number)  that  q  eventually  sends  the  message  p  is  waiting  for. 

In  both  cases  p  is  not  blocked  forever  and  reaches  Phase  3.  □ 

Since  S  satisfies  weak  accuracy  there  is  a  correct  process  c  that  is  never  suspected  by  any 
process,  i.e.,  Vt  €  T,  Vp  €  II  —  F(t)  :  c  Hg(p,  t).  Let  IIj  denote  the  set  of  processes  that 
complete  all  n  —  1  rounds  of  Phase  1,  and  II2  denote  the  set  of  processes  that  complete 
Phase  2.  We  say  Vp  <  Vq  if  and  only  if  for  all  k  6  II,  Vp[k]  is  either  V^[fc]  or  X. 

Lemma  10:  In  every  round  r,  l<r<n-l,  all  processes  p  €  Di  receive  (r,  Ac,  c)  from 
process  c,  i.e.,  (r,  Ac,c)  is  in  msgsp[r\. 

uIn  the  literature,  t  is  often  used  instead  of  /,  the  notation  adopted  here.  In  this  paper,  we  reserve  t 
to  denote  real-time. 
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Every  process  p  executes  the  following : 


procedure  propose(vp) 

Vp  «—  ( _L,  -L, . . . ,  ±)  {p’s  estimate  of  the  proposed  values} 

Vp\p]  «-  vp 

Ap*-Vp 


Phase  1:  {asynchronous  rounds  rp,  1  <  rp  <  n  —  1} 
for  rp  «—  1  to  n  —  1 
send  (rp,  Ap,p)  to  all 

wait  until  [Vg :  received  (rp,Aqtq)  or  q  G  Sp\ 

{  Query  the  failure  detector} 
msgsp[rp]  *-  {(rp,A„<?)  |  received  (rp,Aq,q)} 

Ap  <—  (_L, ±, ...,±> 
for  k  r-  1  to  n 

if  Vp[k]  =  JL  and  3(rp,  Aq,q)  €  msgsp[rp]  with  A,[fc]  ±  _L  then 

Vp[*l  -  A,M 

A#]  «-  A,[*] 

Phase  2:  send  Vp  to  all 

wait  until  [Vg  :  received  Vq  or  q  €  «SP]  {Query  the  failure  detector} 
lastmsgsp  *—  {V^  |  received  V9} 
for  k  <—  1  to  n 

if  3V,  €  lastmsgsp  with  =  1  then  <-  ± 

Phase  3:  decide{  first  non-±  component  of  Vp) 


Figure  5:  Solving  Consensus  using  S. 
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Proof:  Since  p  €  III,  p  completes  all  n  —  1  rounds  of  Phase  1.  At  each  round  r,  since 
c  &  Sp,  p  waits  for  and  receives  the  message  (r,  Ac,  c)  from  c.  □ 

Lemma  11:  For  all  p  6  III,  Vc  <  Vp  at  the  end  of  Phase  1. 

PROOF:  Suppose  for  some  process  q,  Vc[<?]  ^  X  at  the  end  of  Phase  1.  From  Lemma  8, 
Vc[q]  =  vq.  Consider  any  p  6  III-  We  must  show  that  Vp[g]  =  vq  at  the  end  of  Phase  1. 
This  is  obvious  if  p  =  c,  thus  we  consider  the  case  where  p  #  c. 

Let  r  be  the  first  round  in  which  c  received  vq  (if  c  =  q,  we  define  r  to  be  0).  From 
the  algorithm,  it  is  clear  that  A c[q]  =  vq  at  the  end  of  round  r.  There  are  two  cases  to 
consider: 

1.  r  <  n  —  2.  In  round  r  +  1  <  n  -  1,  c  relays  vq  by  sending  the  message  (r  4- 1,  Ac,  c) 
with  Ac[g]  =  vq  to  all.  From  Lemma  10,  p  receives  (r  +  1,  Ae,c)  in  round  r  +  1. 
From  the  algorithm,  it  is  clear  that  p  sets  Vp[q\  to  vq  by  the  end  of  round  r  +  1. 

2.  r  =  n  —  1.  In  this  case,  c  received  vq  for  the  first  time  in  round  n  —  1.  Since  each 
process  relays  vq  (in  its  vector  A)  at  most  once,  it  is  easy  to  see  that  vq  was  relayed 
by  all  n  —  1  processes  in  II  —  {c},  including  p,  before  being  received  by  c.  Since  p 
sets  Vp[q]  =  vq  before  relaying  vq,  it  follows  that  Vp[g]  =  vq  at  the  end  of  Phase  !.□ 


Lemma  12:  For  all  p  €  II2,  Vc  —  Vp  at  the  end  of  Phase  2. 

Proof:  Consider  any  p  €  n2  and  q  €  II.  We  have  to  show  that  Vp[q]  =  Vc[q\  at  the  end 
of  Phase  2.  There  are  two  cases  to  consider: 

1-  Vc[q]  —  vq  at  the  end  of  Phase  1.  From  Lemma  11,  for  all  processes  p'  e  III 
(including  p  and  c),  V^>[q]  =  vq  at  the  end  of  Phase  1.  Thus,  for  all  the  vectors 
V  sent  in  Phase  2,  V[g]  =  vq.  Hence,  both  Vp[q]  and  Vc[q\  remain  equal  to  vq 
throughout  Phase  2. 

2.  Vc[q\  =  _L  at  the  end  of  Phase  1.  Since  c  &  Sp,  p  waits  for  and  receives  Vc  in  Phase 
2.  Since  Vc[q]  =  X,  p  sets  Vp[<ft  <—  X  at  the  end  of  Phase  2.  □ 

Lemma  13:  For  all  p  €  n2,  lp[c]  =  vc  at  the  end  of  Phase  2. 

Proof:  It  is  clear  from  the  algorithm  that  Vc[c]  =  ve  at  the  end  of  Phase  1.  From 
Lemma  11,  for  all  q  €  Hi,  V^[c]  =  vc  at  the  end  of  Phase  1.  Thus,  no  process  sends  V 
with  V[c]  =  X  in  Phase  2.  From  the  algorithm,  it  is  clear  that  for  all  p  6  II2,  Vp[c]  =  vc 
at  the  end  of  Phase  2.  □ 

Theorem  14:  Given  any  Strong  failure  detector  5,  the  algorithm  in  Figure  5  solves 
Consensus  in  asynchronous  systems  with  /  <  n. 

P&OOF:  From  the  algorithm  in  Figure  5,  it  is  clear  that  no  process  decides  more  than 
once,  and  this  satisfies  the  uniform  integrity  requirement  of  Consensus.  From  Lemma  9, 
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every  correct  process  eventually  reaches  Phase  3.  From  Lemma  13,  the  vector  Vv  of  every 
correct  has  at  least  one  non-X  component  in  Phase  3  (namely,  Vp[c]  =  vc).  From  the 
algorithm,  every  process  p  that  reaches  Phase  3,  decides  on  the  first  non-X  component  of 
Vp.  Thus,  every  correct  process  decides  some  non-X  value  in  Phase  3 — and  this  satisfies 
termination  of  Consensus.  From  Lemma  12,  all  processes  that  reach  Phase  3  have  the 
same  vector  V.  Thus,  all  correct  processes  decide  the  same  value,  and  agreement  of 
Consensus  is  satisfied.  From  Lemma  8,  this  non-X  decision  value  is  the  proposed  value 
of  some  process.  Thus,  uniform  validity  of  Consensus  is  also  satisfied.  □ 

By  Theorems  5  and  14,  we  have: 

Corollary  15:  Given  any  Weak  failure  detector  W,  Consensus  is  solvable  in  asyn¬ 
chronous  systems  with  /  <  n. 

6.2  Using  an  Eventually  Strong  failure  detector  OS 

Our  previous  solution  to  Consensus  used  S,  a  failure  detector  that  satisfies  weak  accuracy: 
at  least  one  correct  process  was  never  suspected.  We  now  solve  Consensus  using  OS,  a 
failure  detector  that  only  satisfies  eventual  weak  accuracy.  With  OS,  all  processes  may 
be  erroneously  added  to  the  lists  of  suspects  at  one  time  or  another.  However,  there  is 
a  correct  process  and  a  time  after  which  that  process  is  not  suspected  to  have  crashed. 
Note  that  at  any  given  time  t,  processes  cannot  use  OS  to  determine  whether  any  specific 
process  is  correct,  or  whether  some  correct  process  will  never  be  suspected  after  time  t. 

Given  any  Eventually  Strong  failure  detector  OS,  the  algorithm  in  Figure  6  solves 
Consensus  in  asynchronous  systems  with  a  majority  of  correct  processes.  We  show  that 
solving  Consensus  using  OS  requires  this  majority.14  Thus,  our  algorithm  is  optimal 
with  respect  to  the  number  of  failures  that  it  tolerates. 

The  algorithm  in  Figure  6  uses  the  rotating  coordinator  paradigm  [Rei82,  CM84, 
DLS88,  BGP89,  CT90],  Computation  proceeds  in  asynchronous  “rounds”.  We  assume 
that  all  processes  have  a  priori  knowledge  that  during  round  r,  the  coordinator  is  process 
c  =  (r  mod  n)  +  1.  All  messages  are  either  to  or  from  the  “current”  coordinator.  Every 
time  a  process  becomes  a  coordinator,  it  tries  to  determine  a  consistent  decision  value. 
If  the  current  coordinator  is  correct  and  is  not  suspected  by  any  surviving  process,  then 
it  will  succeed,  and  it  will  R-broadcast  this  decision  value. 

The  algorithm  in  Figure  6  goes  through  three  asynchronous  epochs,  each  of  which 
may  span  several  asynchronous  rounds.  In  the  first  epoch,  several  decision  values  are 
possible.  In  the  second  epoch,  a  value  gets  locked:  no  other  decision  value  is  possible.  In 
the  third  epoch,  processes  decide  the  locked  value. 

Each  round  of  this  Consensus  algorithm  is  divided  into  four  asynchronous  phases. 
In  Phase  1,  every  process  sends  its  current  estimate  of  the  decision  value  timestamped 

14In  fact,  we  show  that  a  majority  of  correct  processes  is  required  even  if  one  uses  OV,  a  stronger 
failure  detector. 


Every  process  p  executes  the  following : 
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procedure  propoae(vp) 

estimate p  <—  vp  {denotes  p’s  estimate  of  the  decision  value} 

state p  *—  undecided 

rp  *—  0  {rp  denotes  the  current  round  number} 

tsp  4—0  {the  round  in  which  estimatep  was  last  updated,  initially  0} 

{Rotate  through  coordinators  until  decision  is  reached } 

while  state p  =  undecided 
rp*-rp  +  l 

Cp  4—  (rp  mod  n)  + 1  {cp  is  the  current  coordinator] 

Phase  1:  {All  processes  p  send  estimatep  to  the  current  coordinator } 
send  {p,rp, estimatep,  tsp)  to  cp 

Phase  2:  {  The  current  coordinator  gathers  n  —  /  estimates  and  proposes  a  new  estimate } 
if  p  =  cp  then 

wait  until  [for  n  -  /  processes  q  :  received  ( q,rv,estimateq,tsq )  from  g] 
msgsp\rv\  *-  {( q,rp,estimateq,tsq )  j  p  received  {q,rp,  estimate  q,tsq)  from  g} 

1 4-  largest  tsq  such  that  ( q,rp,estimateq,tsq )  €  msgsp[rp] 

estimatep  4-  select  one  estimate q  such  that  (q,rp,estimateq,t)  £  msgsp[rp] 

send  (p,  rp,  estimatep)  to  all 


Phase  3:  {All  processes  wait  for  the  new  estimate  proposed  by  the  current  coordinator } 

wait  until  [received  {cp,rp, estimate^)  from  cp  or  cp  €  OSp]  {Query  the  failure  detector} 
if  [received  {cp,rp, estimate^)  from  cp]  then  {p  received  estimate Cf  from  cp} 

estimatep  *-  estimate Cf 

tSp  4—  Tp 

send  (p,rp,ack)  to  cp 

else  send  (p,rp,nack)  to  cp  {p  suspects  that  cp  crashed } 


{The  ct 
«-/ 


Phase  4*  1  current  coordinator  waits  forn-f  replies.  If  these  replies  indicate  that 
{  n  —  /  processes  adopted  its  estimate,  the  coordinator  sends  a  request  to  decide. 
Up  — Cp  then 

wait  until  [for  n  -  f  processes  g  :  received  ( q,rp,ack )  or  (g,rp,nocfc)] 

If  [for  n  -  /  processes  g  :  received  (g,rp,acj|;)]  then 
R-broadcast(p,  rp ,  estimatep,  decide) 


{  When  p  receives  a  decide  message,  it  decides } 


when  R-deliver[q,rt,estimateq, decide) 
if  stattp  =  undecided  then 
dedde(e$timateq) 
statCp  4-  decided 


Figure  6:  Solving  Consensus  using  OS 
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with  the  round  number  in  which  it  adopted  this  estimate,  to  the  current  coordinator, 
c.  In  Phase  2,  c  gathers  n  —  f  such  estimates,  selects  one  with  the  largest  timestamp, 
and  sends  it  to  all  the  processes  as  their  new  estimate,  estimatec.  In  Phase  3,  for  each 
process  p  there  are  two  possibilities: 

1.  p  receives  estimate c  from  c  and  sends  an  ack  to  c  to  indicate  that  it  adopted 
estimate,,  as  its  own  estimate;  or 

2.  upon  consulting  its  failure  detector  module  OSp,  p  suspects  that  c  crashed,  and 
sends  a  nack  to  c. 

In  Phase  4,  c  waits  for  n  -  /  replies  (acks  or  nacks).  If  all  n  -  f  replies  are  acks,  then 
c  knows  that  n  —  f  processes  changed  their  estimates  to  estimate c,  and  thus  estimatec 
is  locked.  Consequently,  c  R-broadcasts  a  request  to  decide  estimatec.  At  any  time,  if  a 
process  R-delivers  such  a  request,  it  decides  accordingly. 

The  proof  that  the  algorithm  in  Figure  6  solves  Consensus  is  as  follows.  Let  R  be 
any  run  of  the  algorithm  in  Figure  6  using  OS  in  which  all  correct  processes  propose 
a  value.  We  have  to  show  that  termination,  uniform  validity,  agreement  and  uniform 
integrity  hold. 

Lemma  17:  No  two  processes  decide  differently.15 

PROOF:  If  no  process  ever  decides,  the  lemma  is  trivially  true.  If  any  process  decides,  it 
must  be  the  case  that  a  coordinator  R-broadcast  a  message  of  the  type  (-,  -,  -,  decide). 
This  coordinator  must  have  received  n  -  f  messages  of  the  type  (-,  -,ack)  in  Phase 
4.  Let  r  be  the  smallest  round  number  in  which  n  —  /  messages  of  the  type  (— ,r,adfc) 
are  sent  to  a  coordinator  in  Phase  3.  Let  c  denote  the  coordinator  of  round  r,  i.e., 
c  =  (r  mod  n)  +  1.  Let  estimatec  denote  c’s  estimate  at  the  end  of  Phase  2  of  round  r. 
We  claim  that  for  all  rounds  r'  >  r,  if  a  coordinator  d  sends  estimate d  in  Phase  2  of 
round  r',  then  estimatec/  =  estimatec. 

The  proof  is  by  induction  on  the  round  number.  The  claim  trivially  holds  for  r'  =  r. 
Now  assume  that  the  claim  holds  for  all  r',  r  <  r'  <  k.  Let  c*  be  the  coordinator  of 
round  k,  i.e.,  c*  =  (k  mod  n)  +  1.  We  will  show  that  the  claim  holds  for  r'  =  k,  i.e.,  if 
Cfc  sends  estimateCh  in  Phase  2  of  round  k,  then  estimateCh  =  estimatec. 

From  the  algorithm  it  is  clear  that  if  c*  sends  estimateCh  in  Phase  2  of  round  k  then 
it  must  have  received  estimates  from  at  least  n  -  f  processes.  Since  /  <  |,  there  is  some 
process  p  such  that  p  sent  a  (p,  r,  ack)  message  to  c  in  Phase  3  of  round  r  and  such  that 
(p,  k,  estimatep,  tsp)  is  in  msgsei  [fc]  in  Phase  2  of  round  k.  Since  p  sent  (p,  r,  ack )  to  c  in 
Phase  3  of  round  r,  tsp  =  r  at  the  end  of  Phase  3  of  round  r.  Since  tsp  is  non-decreasing, 
tsp  >  r  in  Phase  1  of  round  k.  Thus  in  Phase  2  of  round  k,  (p,  k,  estimate p,  tsp)  is  in 
msgsch[k\  with  tsp  >  r.  It  is  easy  to  see  that  there  is  no  message  (q,  k,  estimateq,tsq)  in 

uThiB  property,  called  uniform  agreement ,  is  stronger  than  the  agreement  requirement  of  Consensus 
which  applies  only  to  correct  processes. 
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msgsCk  [A:]  for  which  tsq  >  k.  Let  t  be  the  largest  tsq  such  that  (q,  k,  estimateq,tsq)  is  in 
ms<7aCjk[A:].  Thus  r  <t  <  k. 

In  Phase  2  of  round  k ,  c*  executes  esiimateCk  <—  estimateq  where 
(q,  k,estimateq,t )  is  in  ms0SCjk[fc].  From  Figure  6,  it  is  clear  that  q  adopted  estimate,  as 
its  estimate  in  Phase  3  of  round  t.  Thus,  the  coordinator  of  round  t  sent  estimateq  to  q  in 
Phase  2  of  round  t.  Since  r  <  t  <  k,  by  the  induction  hypothesis,  estimateq  =  estimatec. 
Thus,  Cfc  sets  estimateCk  «—  estimatec  in  Phase  2  of  round  k.  This  concludes  the  proof 
of  the  claim. 

We  now  show  that  if  a  process  decides  a  value,  then  it  decides  estimate,..  Suppose  that 
some  process  p  R-delivers  (q,rq,estimateq,  decide),  and  thus  decides  estimateq.  Process 
q  must  have  R-broadcast  (9,  r,,  estimate,,  decide)  in  Phase  4  of  round  r,.  From  Figure 
6,  q  must  have  received  n  —  f  messages  of  the  type  (— ,rq,ack )  in  Phase  4  of  round  r,. 
By  the  definition  of  r,  r  <  rq.  From  the  above  claim,  estimateq  =  estimatec.  □ 

Lemma  18:  Every  correct  process  eventually  decides  some  value. 

Proof:  There  are  two  possible  cases: 

1.  Some  correct  process  decides.  It  must  have  R-delivered  some  message  of  the  type 

decide).  By  the  agreement  property  of  Reliable  Broadcast,  all  correct 
processes  eventually  R-deliver  this  message  and  decide. 

2.  No  correct  process  decides.  We  claim  that  no  correct  process  remains  blocked 
forever  at  one  of  the  wait  statements.  The  proof  is  by  contradiction.  Let  r  be  the 
smallest  round  number  in  which  some  correct  process  blocks  forever  at  one  of  the 
wait  statements.  Thus,  all  correct  processes  reach  the  end  of  Phase  1  of  round  r: 
they  all  send  a  message  of  the  type  (— ,r,  estimate,  — )  to  the  current  coordinator 
c  =  (r  mod  n)  + 1.  Therefore  at  least  n  —  /  such  messages  are  sent  to  c.  There  are 
two  cases  to  consider: 

(a)  Eventually,  c  receives  those  messages  and  replies  by  sending 
(c,r,  estimatec).  Thus,  c  does  not  block  forever  at  the  wait  statement  in 
Phase  2. 

(b)  c  crashes. 

In  the  first  case,  every  correct  process  receives  ( c,r,estimatec ).  In  the  second  case, 
since  OS  satisfies  strong  completeness ,  for  every  correct  process  p  there  is  a  time 
after  which  c  is  permanently  suspected  by  p,  i.e.,  c  €  O SP.  Thus  in  either  case,  no 
correct  process  blocks  at  the  second  wait  statement  (Phase  3).  So  every  correct 
process  sends  a  message  of  the  type  (-,  r,  ack)  or  (  - ,  r,  nack)  to  c  in  Phase  3.  Since 
there  are  n  —  /  correct  processes,  c  cannot  block  at  the  wait  statement  of  Phase 
4.  This  shows  that  all  correct  processes  complete  round  r — a  contradiction  that 
completes  the  proof  of  our  claim. 

Since  05  satisfies  eventual  weak  accuracy,  there  is  a  correct  process  q  and  a  time 
t  such  that  no  correct  process  suspects  q  after  t.  Thus,  all  processes  that  suspect 
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q  after  time  t  eventually  crash  and  there  is  a  time  t'  after  which  no  process  sends 
a  message  of  the  type  (  —  ,r,nack )  where  q  is  the  coordinator  of  round  r  (i.e., 
q  =  (r  mod  n)  +  1).  From  this  and  the  above  claim,  there  must  be  a  round  r  such 
that: 

(a)  All  correct  processes  reach  round  r  after  time  t'  (when  no  process  suspects  q). 

(b)  q  is  the  coordinator  of  round  r  (i.e.,  q  =  (r  mod  n)  +  1). 

In  Phase  1  of  round  r,  all  correct  processes  send  their  estimates  to  q.  In  Phase 
2,  q  receives  n  —  f  such  estimates,  and  sends  ( q,r,estimateq )  to  all  processes.  In 
Phase  3,  since  q  is  not  suspected  by  any  correct  process  after  time  t,  every  correct 
process  waits  for  q's  estimate,  eventually  receives  it,  and  replies  with  an  ack  to  q. 
Furthermore,  no  process  sends  a  nack  to  q  (that  can  only  happen  when  a  process 
suspects  q).  Thus  in  Phase  4,  q  receives  n  -  /  messages  of  the  type  (  — ,  r,  ack)  (and 
no  messages  of  the  type  ( -,r,nack )),  and  q  R-broadcasts  (q,r,estimateq,  decide). 
By  the  validity  property  of  Reliable  Broadcast,  eventually  all  correct  processes  R- 
deliver  q's  message  and  decide — a  contradiction.  Thus  case  2  is  impossible,  and 
this  concludes  the  proof  of  the  lemma.  □ 

Theorem  19:  Given  any  Eventually  Strong  failure  detector  OS,  the  algorithm  in  Figure 
6  solves  Consensus  in  asynchronous  systems  with  /  <  f  ■ 

Proof: 

Termination:  by  Lemma  18. 

Agreement:  by  Lemma  17. 

Uniform  integrity:  It  is  clear  from  the  algorithm  that  no  process  decides  more  than  once. 

Uniform  validity:  from  the  algorithm,  it  is  clear  that  all  the  estimates  that  a  coordinator 
receives  in  Phase  2  are  proposed  values.  Therefore,  the  decision  value  that  a 
coordinator  selects  from  these  estimates  must  be  the  value  proposed  by  some 
process.  Thus,  uniform  validity  is  satisfied.  □ 

By  Theorems  5  and  19,  we  have: 

Corollary  20:  Given  any  Eventually  Weak  failure  detector  OW,  Consensus  is  solvable 
in  asynchronous  systems  with  /  <  f  ■ 

Thus,  the  weakest  failure  detector  considered  in  this  paper,  OW,  is  sufficient  to  solve 
Consensus  in  asynchronous  systems.  This  leads  to  the  following  question:  What  is  the 
weakest  failure  detector  for  solving  Consensus?  Using  the  concept  of  reducibility,  in 
[CHT92]  we  show  that  OW  is  indeed  the  weakest  failure  detector  for  solving  Consensus 
in  asynchronous  systems  with  a  majority  of  correct  processes.  More  precisely,  we  show: 
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Theorem  21:  [CHT92]  If  a  failure  detector  T>  can  be  used  to  solve  Con'ensus  in  an 
asynchronous  system,  then  V  >  OW  in  that  system. 

By  Corollary  20  and  Theorem  21,  we  have: 

Corollary  22:  OW  is  the  weakest  failure  detector  for  solving  Consensus  in  an  asyn¬ 
chronous  system  with  /  <  |. 

6.3  A  lower  bound  on  fault-tolerance 

In  Section  6.1,  we  showed  that  failure  detectors  with  perpetual  accuracy  (i.e.,  V,  Q ,  S,  or 
W)  can  be  used  to  solve  Consensus  in  asynchronous  systems  with  any  number  of  failures. 
In  contrast,  with  failure  detectors  with  eventual  accuracy  (i.e.,  OV,  OQ,  OS,  or  OW), 
our  Consensus  algorithms  required  a  majority  of  the  processes  to  be  correct.  We  now 
show  that  this  requirement  is  necessary:  Any  algorithm  that  uses  OV  (the  strongest  of 
our  four  failure  detectors  with  eventual  accuracy)  to  solve  Consensus  requires  a  majority 
of  correct  processes.  Thus,  the  algorithm  in  Figure  6  is  optimal  with  respect  to  fault- 
tolerance. 

Theorem  23:  There  is  an  Eventually  Perfect  failure  detector  OV  such  that  there  is  no 
algorithm  A  which  solves  Consensus  using  OV  in  asynchronous  systems  with  /  >  [ . 

Proof:  We  now  describe  the  behaviour  of  an  Eventually  Perfect  failure  detector  OV 
such  that  with  every  algorithm  A,  there  is  a  run  RA  of  A  using  OV  that  does  not  satisfy 
the  specification  of  Consensus.  Partition  the  processes  into  two  sets  IIo  and  III  such  that 
n0  contains  processes,  and  IIi  contains  the  remaining  [|J  processes.  Consider  any 
Consensus  algorithm  A,  and  the  following  two  runs  of  A  using  OV: 

•  Run  Ro  =  (F0,  Ho,  I,  So, To):  All  processes  in  IIo  propose  0,  and  all  processes  in 
IIi  propose  1.  All  processes  in  n0  are  correct  in  F0,  while  those  in  II j  crash  in 
F0  at  the  beginning  of  the  run,  i.e.,  Vt  e  T  :  F0(t)  =  IIx  (this  is  possible  since 
/  >  Tfl)-  Every  process  in  IIo  permanently  suspects  every  process  in  IIi,  i.e., 
Vt  €  T,  Vp  €  IIo  :  Ho(p,t)  =  IIi.  In  this  run,  it  is  clear  that  OV  satisfies  the 
specification  of  an  Eventually  Perfect  failure  detector. 

•  Run  Ri  =  ( F\,Hi,I,S\,T\ ):  As  in  Ro,  all  processes  in  n0  propose  0,  and  all 
processes  in  IIi  propose  1.  All  processes  in  IIi  axe  correct  in  F\,  while  those  in  n0 
crash  in  Fx  at  the  beginning  of  the  run,  i.e.,  Vt  e  T  :  Fi(t)  =  n0.  Every  process  in 
IIi  permanently  suspects  every  process  in  n0,  i.e.,  Vt  6  T,  Vp  e  IIi  :  Hi(p,  t)  =  n0. 
Clearly,  OV  satisfies  the  specification  of  an  Eventually  Perfect  failure  detector  in 
this  run. 

Assume,  without  loss  of  generality,  that  both  Ro  and  Ri  satisfy  the  specifications  of 
Consensus.  Let  go  G  IIo,  Qi  €  IIi,  to  be  the  time  at  which  go  decides  in  Rq,  and  tx  be  the 
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time  at  which  gi  decides  in  R\.  There  are  three  possible  cases — in  each  case  we  construct 
a  run  RA  =  (F  A,  H  A,  I  a,  S  a,T  A)  of  algorithm  A  using  OP  such  that  OP  satisfies  the 
specification  of  an  Eventually  Perfect  failure  detector,  but  RA  violates  the  specification 
of  Consensus. 

1.  In  Rq,  q0  decides  1.  Let  RA  =  { Fo,Ho,Ia,So,To )  be  a  run  identical  to  Rq  except 
that  all  processes  in  III  propose  0.  Since  in  F0  the  processes  in  IIx  crash  right 
from  the  beginning  of  the  run,  Rq  and  RA  are  indistinguishable  to  q0.  Thus,  q0 
decides  1  in  RA  (as  it  did  in  i?0),  thereby  violating  the  uniform  validity  condition 
of  Consensus. 

2.  In  ill,  <h  decides  0.  This  case  is  symmetric  to  Case  1. 

3.  In  ilo,  qo  decides  0,  and  in  ill,  qi  decides  1.  Construct  RA  =  ( FA ,  H A,  I,  SA ,  TA )  as 
follows.  No  processes  crash  in  FA ,  i.e.,  Vt  G  T  :  FA(t)  =  0.  As  before,  all  processes 
in  n0  propose  0  and  all  processes  in  Hi  propose  1.  All  messages  from  processes 
in  n0  to  those  in  III  and  vice-versa,  are  delayed  until  time  max(t0,  <i)-  Until  time 
max(toifi),  every  process  in  n0  suspects  every  process  in  III,  and  every  process  in 
III  suspects  every  process  in  n0.  After  time  max(to,ti),  no  process  suspects  any 
other  process,  i.e.: 

Vt  <  max(t0,ti)  : 

Vp€n0:^(p,t)  =  ni 

Vp  €  III  :  Ha(p,  t)  —  n0 

Vt  >  max(t0,ti),  Vp  €  II :  HA{p,t)  =  0 

Clearly,  OP  satisfies  the  specification  of  an  Eventually  Perfect  failure  detector. 

Until  time  max(t0,<i),  RA  is  indistinguishable  from  Ro  for  processes  in  n0,  and  RA 
is  indistinguishable  from  Ri  for  processes  in  III-  Thus  in  run  RA,  qo  decides  0  at 
time  to,  while  qx  decides  1  at  time  t\.  So  go  and  9i  decide  differently  in  RA,  and 
this  violates  the  agreement  condition  of  Consensus.  □ 

In  the  Appendix,  we  refine  the  result  of  Theorem  23,  by  considering  an  infinite 
hierarchy  of  failure  detectors  ordered  by  the  number  of  mistakes  they  can  make,  and 
showing  exactly  where  in  this  hierarchy  the  majority  requirement  becomes  necessary  for 
solving  Consensus  (this  hierarchy  contains  all  eight  failure  detectors  that  we  defined  in 
Figure  1).  Note  that  Theorem  23  is  also  a  corollary  of  Theorem  4.3  in  [DLS88]  together 
with  Theorem  35. 


7  On  Atomic  Broadcast 

We  now  consider  Atomic  Broadcast,  another  fundamental  problem  in  fault  tolerant  dis¬ 
tributed  computing,  and  show  that  our  results  on  Consensus  also  apply  to  Atomic  Broad¬ 
cast.  Informally,  Atomic  Broadcast  requires  that  all  correct  processes  deliver  the  same 
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messages  in  the  same  order.  Formally,  Atomic  Broadcast  is  a  Reliable  Broadcast  that 
satisfies: 

•  Total  order.  If  two  correct  processes  p  and  q  deliver  two  messages  m  and  m',  then 
p  delivers  m  before  m'  if  and  only  if  q  delivers  m  before  m'. 

Total  order  and  agreement  ensure  that  all  correct  processes  deliver  the  same  sequence 
of  messages.  Atomic  Broadcast  is  a  powerful  communication  paradigm  for  fault-tolerant 
distributed  computing  [CM84,  CASD85,  BJ87,  PGM89,  BGT90,  GSTC90,  Sch90].  We 
now  show  that  Consensus  and  Atomic  Broadcast  are  equivalent  in  asynchronous  systems 
with  crash  failures.  This  is  shown  by  reducing  each  to  the  other.16  In  other  words,  a 
solution  for  one  automatically  yields  a  solution  for  the  other.  Both  reductions  apply 
to  any  asynchronous  system  (in  particular,  they  do  not  require  the  assumption  of  a 
failure  detector).  This  equivalence  has  important  consequences  regarding  the  solvability 
of  Atomic  Broadcast  in  asynchronous  systems: 

1.  Atomic  Broadcast  cannot  be  solved  with  a  deterministic  algorithm  in  asynchronous 
systems,  even  if  we  assume  that  at  most  one  process  may  fail,  and  it  can  only  fail  by 
crashing.  This  is  because  Consensus  has  no  deterministic  solution  in  such  systems 
[FLP85]. 

2.  Atomic  Broadcast  can  be  solved  using  randomization  or  unreliable  failure  detec¬ 
tors  in  asynchronous  systems.  This  is  because  Consensus  is  solvable  with  these 
techniques  in  such  systems  (for  a  survey  of  randomized  Consensus  algorithms,  see 
[CD89]). 

Consensus  can  be  easily  reduced  to  Atomic  Broadcast  as  follows.  To  propose  a  value, 
a  process  atomically  broadcasts  it.  To  decide  a  value,  a  process  picks  the  value  of 
the  first  message  that  it  atomically  delivers.17  By  total  order  of  Atomic  Broadcast,  all 
correct  processes  deliver  the  same  first  message.  Hence  they  choose  the  same  value  and 
agreement  of  Consensus  is  satisfied.  The  other  properties  of  Consensus  are  also  easy  to 
verify.  In  the  next  section,  we  reduce  Atomic  Broadcast  to  Consensus. 


7.1  Reducing  Atomic  Broadcast  to  Consensus 

In  Figure  7,  we  show  how  to  transform  any  Consensus  algorithm  into  an  Atomic  Broad¬ 
cast  algorithm  in  asynchronous  systems.  The  resulting  Atomic  Broadcast  algorithm 
tolerates  as  many  faulty  processes  as  the  given  Consensus  algorithm. 

The  reduction  uses  Reliable  Broadcast,  and  repeated  (possibly  concurrent,  but  com¬ 
pletely  independent)  executions  of  Consensus.  Processes  disambiguate  between  these 

1*They  are  actually  equivalent  even  in  asynchronous  systems  with  arbitrary  failures.  However,  the 
reduction  is  more  complex  and  is  omitted  here. 

1TNote  that  this  reduction  does  not  require  the  assumption  of  a  failure  detector. 
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executions  by  tagging  all  the  messages  pertaining  to  the  kth  execution  of  Consensus  with 
the  number  k.  Tagging  each  message  with  such  a  number  constitutes  a  minor  modifi¬ 
cation  to  any  given  Consensus  algorithm.  Informally,  the  kth  execution  of  Consensus  is 
used  to  decide  on  the  kth  batch  of  messages  to  be  atomically  delivered.  The  propose 
and  decide  primitives  corresponding  to  the  kth  execution  of  Consensus  are  denoted  by 
propose{k,  — )  and  decide{k,  -). 

Our  Atomic  Broadcast  algorithm  uses  the  R-broadcast{m )  and  R-deliver(m )  primi¬ 
tives  of  Reliable  Broadcast.  To  avoid  possible  ambiguities  between  Atomic  Broadcast 
and  Reliable  Broadcast,  we  say  that  a  process  A-broadcasts  or  A-delivers  to  refer  to  a 
broadcast  or  a  delivery  associated  with  Atomic  Broadcast;  and  R-broadcasts  or  R-delivers 
to  refer  to  a  broadcast  or  delivery  associated  with  Reliable  Broadcast. 

When  a  process  intends  to  A-broadcast  a  message  m,  it  R-broadcasts  m  (in  Task 
1).  When  a  process  p  R-delivers  m,  it  adds  m  to  the  set  Rjieliveredp  (Task  2).  Thus, 
Rjdeliveredp  contains  all  the  messages  submitted  for  Atomic  Broadcast  (since  the  begin¬ 
ning)  that  p  is  currently  aware  of.  When  p  A-delivers  a  message  m,  it  adds  m  to  the  set 
Ajdeliveredp  (in  Task  3).  Thus,  Rjdeliveredp  -  Ajdeliveredp  is  the  set  of  messages  that 
were  submitted  for  Atomic  Broadcast  but  not  yet  A-delivered  by  p.  This  set  is  denoted 
by  Ajundeliveredp.  When  Ajundeliveredp  is  not  empty,  p  proposes  Ajundeliveredp  as 
the  next  batch  of  messages  to  be  A-delivered.  batchp(k)  denotes  the  kth  batch  of  messages 
that  p  A-delivers:  it  is  msgSetp,  the  set  of  messages  agreed  upon  by  the  kth  execution  of 
Consensus,  minus  Ajdeliveredp,  those  messages  that  p  has  already  A-delivered.18  Pro¬ 
cess  p  delivers  the  messages  in  batch(k)  in  some  deterministic  order,  e.g.,  lexicographical 
order,  that  was  agreed  a  priori  by  all  processes.  This  transformation  of  Consensus  into 
Atomic  Broadcast  is  described  in  Figure  7  as  three  concurrent  and  indivisible  tasks.  The 
proof  of  correctness  follows. 

Lemma  24:  For  any  two  correct  processes  p  and  q,  and  any  message  m,  if  m  G 
Rjdeliveredp  then  eventually  m  G  Rjdeliveredq. 

Proof:  If  m  G  Rjdeliveredp  then  p  R-delivered  m  (in  Task  2).  Since  p  is  cor¬ 
rect,  by  agreement  of  Reliable  Broadcast  q  eventually  R-delivers  m,  and  inserts  m  into 
Rjdeliveredq.  □ 

Lemma  25:  For  any  two  correct  processes  p  and  q,  and  all  k  >  1: 

1.  If  p  executes  propose{k,  -),  then  q  eventually  executes  propose{k,  — ). 

2.  If  p  A-delivers  messages  in  batchp(k),  then  q  eventually  A-delivers  messages  in 
batchq(k),  and  batchp(k)  =  batchq(k). 

Proof:  The  proof  is  by  simultaneous  induction  on  (1)  and  (2).  For  k  =  1,  we  first 
show  that  if  p  executes  propose^  1,  — ),  then  q  eventually  executes  propose(l,  — ).  When  p 

uIt  is  possible  for  a  process  p  to  A-deliver  a  message  m  before  it  R-delivers  m.  This  occurs  if  m  was 
proposed  by  another  process,  and  agreed  upon  by  Consensus,  before  p  R-delivers  m. 
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Every  process  p  executes  the  following: 

Initialization : 

Rjdelivered  <—  0 
Ajdelivered  <—  0 
k  <-  0 

To  execute  A-broadcast{m):  {  Task  1  } 

R-broadcast(m) 

A-deliver{-)  occurs  as  follows: 

when  R-deliver(m)  {  Task  2  } 

Rjdelivered  Rjdelivered  U  {m} 

when  Rjdelivered  —  Ajdelivered  ^0  {  Task  3  } 

k  «—  k  +  1 

Ajundelivered  <—  Rjdelivered  —  Ajdelivered 
propo8e(k ,  Ajundelivered) 
wait  until  decide{k,msgSet) 
batch(k)  *-  msgSet  —  Ajdelivered 

atomically  deliver  all  messages  in  batch(k)  in  some  deterministic  order 
Ajdelivered  <—  Ajdelivered  U  hatch(k) 


Figure  7:  Using  Consensus  to  solve  Atomic  Broadcast 
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executes  propose(l ,  — ),  Redeliver ecLp  must  contain  some  message  m.  By  Lemma  24,  m  is 
eventually  in  Redeliver edq.  Since  Ajdeliveredq  is  initially  empty,  eventually  R.deliveredq  - 
Ajdeliveredq  /  0.  Thus,  q  eventually  executes  Task  3  and  propose(  1,  -). 

We  now  show  that  if  p  A-delivers  messages  in  batchp(  1),  then  q  eventually  A-delivers 
messages  in  batchq(  1),  and  batchp(l)  =  batchq(l).  Prom  the  algorithm,  if  p  A-delivers 
messages  in  batchp(l),  it  previously  executed  proposal,  -).  From  part  (1)  of  the  lemma, 
all  correct  processes  eventually  execute  propose(  1,  — ).  By  termination  and  uniform  in¬ 
tegrity  of  Consensus,  every  correct  process  eventually  executes  decide{  1,  -)  and  it  does 
so  exactly  once.  By  agreement  of  Consensus,  all  correct  processes  eventually  execute 
decide(l,  msgSet)  with  the  same  msgSet.  Since  Ajdeliveredp  and  Ajdeliveredq  are  ini¬ 
tially  empty,  batchp(l)  =  batchq(l)  =  msgSetp  =  msgSetq. 

Now  assume  that  the  lemma  holds  for  all  A:,  1  <  k  <  l.  We  first  show  that  if 
p  executes  propose(l,—),  then  q  eventually  executes  propose(l,  —).  When  p  executes 
proposal,  — ),  R-deliveredp  must  contain  some  message  m  that  is  not  in  Ajieliveredp. 
Thus,  m  is  not  in  U*=x  batchp(k).  By  the  induction  hypothesis,  batchj,(k)  =  batchq(k) 
for  all  1  <  fc  <  Z  —  1.  So  to  is  not  in  U*L\  batchq(k).  Since  m  is  in  Rjdeliveredp, 
by  Lemma  24,  m  is  eventually  in  Rjdeliveredq.  Thus,  there  is  a  time  after  q  A-delivers 
batchq(l—l)  such  that  there  is  a  message  in  Rjdeliveredq  —  Ajdeliveredq.  So  q  eventually 
executes  Task  3  and  proposal,  — ). 

We  now  show  that  if  p  A-delivers  messages  in  batchj,(l),  then  q  A-delivers  messages 
in  batchq(l),  and  batchp(l)  —  batchq(l).  Since  p  A-delivers  messages  in  batchp(l),  it  must 
have  executed  proposal,  -).  By  part  (1)  of  this  lemma,  all  correct  processes  eventually 
execute  proposal,  -).  By  termination  and  uniform  integrity  of  Consensus,  every  correct 
process  eventually  executes  decide(l,—)  and  it  does  so  exactly  once.  By  agreement 
of  Consensus,  all  correct  processes  eventually  execute  decide(l,  msgSet)  with  the  same 
msgSet.  Note  that  batchj,(l)  =  msgSet  -  (jj^  batchp(k),  and  batchq(l)  =  msgSet  - 
U*=i  batchq(k).  By  the  induction  hypothesis,  batchp(k)  =  batchq(k)  for  all  1  <  k  <  l  -  1. 
Thus,  batchp(l)  =  batchq(l).  □ 

Lemma  26:  The  algorithm  in  Figure  7  satisfies  agreement  and  total  order. 

PROOF:  Immediate  from  Lemma  25,  and  the  fact  that  correct  processes  A-deliver  mes¬ 
sages  in  each  batch  in  the  same  deterministic  order.  □ 

Lemma  27:  (Validity)  If  a  correct  process  A-broadcasts  m,  then  all  correct  processes 
eventually  A-deliver  m. 

Proof:  The  proof  is  by  contradiction.  Suppose  some  correct  process  p  A-broadcasts 
m,  and  some  correct  process  never  A-delivers  m.  By  Lemma  26,  no  correct  process 
A-delivers  m. 

By  Task  1  of  Figure  7,  p  R-broadcasts  m.  By  validity  of  Reliable  Broadcast,  every 
correct  process  q  eventually  R-delivers  m,  and  inserts  m  in  Rjdeliveredq  (Task  2).  Since 
correct  processes  never  A-deliver  m,  they  never  insert  m  in  AJLelivered.  Thus,  for 
every  correct  process  q,  there  is  a  time  after  which  m  is  permanently  in  R.deliveredq  - 
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Ajdeliveredq.  From  Figure  7  and  Lemma  25,  there  is  a  k\,  such  that  for  all  /  >  A: x,  all 
correct  processes  execute  propose(l ,  — ),  and  they  do  so  with  sets  that  always  include  m. 

Since  all  faulty  processes  eventually  crash,  there  is  a  k2  such  that  no  faulty  process 
executes  proposal,—)  with  /  >  A^.  Let  k  =  max(ki,k2).  Since  all  correct  processes 
execute  propose(k,  — ),  by  termination  and  agreement  of  Consensus,  all  correct  processes 
execute  decide(k,  msgSet)  with  the  same  msgSet.  By  uniform  validity  of  Consensus, 
some  process  q  executed  propose(k, msgSet).  From  our  definition  of  fc,  q  is  correct 
and  msgSet  contains  m.  Thus  all  correct  processes  A-deliver  m — a  contradiction  that 
concludes  the  proof.  □ 

Lemma  28:  (Uniform  integrity)  For  any  message  m,  each  process  A-delivers  m  at  most 
once,  and  only  if  m  was  A-broadcast  by  some  process. 

Proof:  Suppose  a  process  p  A-delivers  m.  After  p  A-delivers  m,  it  inserts  m  in 
Ajdeliveredp.  From  the  algorithm,  it  is  clear  that  p  cannot  A-deliver  m  again. 

From  the  algorithm,  p  executed  decide(k,  msgSet)  for  some  k  and  some  msgSet  that 
contains  m.  By  uniform  validity  of  Consensus,  some  process  q  must  have  executed 
propose{k,  msgSet).  So  q  previously  R-delivered  all  the  messages  in  msgSet,  including 
m.  By  uniform  integrity  of  Reliable  Broadcast,  some  process  r  R-broadcast  m.  So,  r 
A-broadcast  m.  □ 

Theorem  29:  Consider  any  system  (synchronous  or  asynchronous)  subject  to  crash 
failures  and  where  Reliable  Broadcast  can  be  implemented.  The  algorithm  in  Figure  7 
transforms  any  algorithm  for  Consensus  into  an  Atomic  Broadcast  algorithm. 

Proof:  Immediate  from  Lemmata  26,  27,  and  28.  □ 

Since  Reliable  Broadcast  can  be  implemented  in  asynchronous  systems  with  crash 
failures  (Section  4),  the  above  theorem  shows  that  Atomic  Broadcast  is  reducible  to 
Consensus  in  those  systems.  As  we  argued  earlier,  the  converse  is  also  true.  Thus: 

Corollary  30:  Consensus  and  Atomic  Broadcast  are  equivalent  in  asynchronous  sys¬ 
tems  with  crash  failures. 

The  equivalence  of  Consensus  and  Atomic  Broadcast  in  asynchronous  systems  immedi¬ 
ately  implies  that  our  results  regarding  Consensus  (in  particular  Corollaries  15  and  22, 
and  Theorem  23)  also  hold  for  Atomic  Broadcast: 

Corollary  31:  Given  any  Weak  failure  detector  W,  Atomic  Broadcast  is  solvable  in 
asynchronous  systems  with  /  <  n. 

Corollary  32:  OVV  is  the  weakest  failure  detector  for  solving  Atomic  Broadcast  in  an 
asynchronous  system  with  /  < 

Corollary  33:  There  is  an  Eventually  Perfect  failure  detector  OV  such  that  there  is 
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no  algorithm  A  which  solves  Atomic  Broadcast  using  OV  in  asynchronous  systems  with 

/  >  rti- 

Furthermore,  Theorem  29  shows  that  by  “plugging  in”  any  randomized  Consensus  algo¬ 
rithm  (such  as  the  ones  in  [CD89])  into  the  algorithm  of  Figure  7,  we  automatically  get 
a  randomized  algorithm  for  Atomic  Broadcast  in  asynchronous  systems. 

Corollary  34:  Atomic  Broadcast  can  be  solved  by  randomized  algorithms  in  asyn¬ 
chronous  systems  with  /  <  f  crash  failures. 

8  Related  work 

8.1  Partial  synchrony 

Fischer,  Lynch  and  Paterson  showed  that  Consensus  cannot  be  solved  in  an  asynchronous 
system  subject  to  crash  failures  [FLP85].  The  fundamental  reason  why  Consensus  can¬ 
not  be  solved  in  completely  asynchronous  systems  is  the  fact  that,  in  such  systems,  it 
is  impossible  to  reliably  distinguish  a  process  that  has  crashed  from  one  that  is  merely 
very  slow.  In  other  words,  Consensus  is  unsolvable  because  accurate  failure  detection  is 
impossible.  On  the  other  hand,  it  is  well-known  that  Consensus  is  solvable  (determinis¬ 
tically)  in  completely  synchronous  systems  —  that  is,  systems  where  clocks  are  perfectly 
synchronised,  all  processes  take  steps  at  the  same  rate  and  each  message  arrives  at  its 
destination  a  fixed  and  known  amount  of  time  after  it  is  sent.  In  such  a  system  we  can 
use  timeouts  to  implement  a  “perfect”  failure  detector  —  i.e.,  one  in  which  no  process 
is  ever  wrongly  suspected,  and  every  faulty  process  is  eventually  suspected.  Thus,  the 
ability  to  solve  Consensus  in  a  given  system  is  intimately  related  to  the  failure  detection 
capabilities  of  that  system.  This  realisation  led  us  to  augment  the  asynchronous  model 
of  computation  with  unreliable  failure  detectors  as  described  in  this  paper. 

A  different  tack  on  circumventing  the  unsolvability  of  Consensus  is  pursued  in  [DDS87] 
and  [DLS88].  The  approach  of  those  papers  is  based  on  the  observation  that  between 
the  completely  synchronous  and  completely  asynchronous  models  of  distributed  systems 
there  lie  a  variety  of  intermediate  “partially  synchronous”  models. 

In  particular,  [DDS87]  defines  a  space  of  32  models  by  considering  five  key  parame¬ 
ters,  each  of  which  admits  a  “favourable”  and  an  “unfavourable”  setting.  For  instance, 
one  of  the  parameters  is  whether  the  maximum  message  delay  is  bounded  and  known 
(favourable  setting)  or  unbounded  (unfavourable  setting).  Each  of  the  32  models  corre¬ 
sponds  to  a  particular  setting  of  the  5  parameters.  [DDS87]  identifies  four  “minimal” 
models  in  which  Consensus  is  solvable.  These  are  minimal  in  the  sense  that  the  weak¬ 
ening  of  any  parameter  from  favourable  to  unfavourable  would  yield  a  model  of  partial 
synchrony  where  Consensus  is  unsolvable.  Thus,  within  the  space  of  the  models  con¬ 
sidered,  [DDS87]  delineates  precisely  the  boundary  between  solvability  and  unsolvability 
of  Consensus,  and  provides  an  answer  to  the  question  “What  is  the  least  amount  of 
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Every  process  p  executes  the  following: 

outputp  <—  0 

for  all  q  €  II  (Ap(q)  denotes  the  duration  of  p’s  time-out  interval  for  q} 

Ap(q)  «—  default  time-out  interval 

cobegin 

||  Task  1:  repeat  periodically 
send  “p-is-alive”  message  to  all 

||  Task  2:  repeat  periodically 
for  all  q  €  II 

if  q  £  outputp  and  p  did  not  receive  “q-is- alive”  in  the  last  A p(q)  seconds 
outputp  outputp  U  {q} 

{p  times-out  on  q:  it  now  suspects  q  has  crashed} 

||  Task  3:  when  receive  “q-is- alive”  for  some  q 

if  q  €  outputp  {p  knows  that  it  prematurely  timed-out  on  q:} 

outputp  <—  outputp  —  {q}  {1.  p  repents  on  q,  and} 

A p(q)  <—  A p(q)  +  1  (2.  p  increases  its  time-out  period  for  q} 

coend 

Figure  8:  A  time-out  based  implementation  of  OV  in  some  models  of  partial  synchrony. 


synchrony  sufficient  to  solve  Consensus?” . 

[DLS88]  considers  the  following  two  models  of  partial  synchrony.  The  first  model 
assumes  that  there  are  bounds  on  relative  process  speeds  and  on  message  transmission 
times,  but  these  bounds  are  not  known.  The  second  model  assumes  that  these  bounds 
are  known,  but  they  hold  only  after  some  unknown  time. 

In  each  one  of  these  two  models  (with  crash  failures),  it  is  easy  to  implement  an 
Eventually  Perfect  failure  detector  OV.  In  fact,  we  can  implement  OV  in  an  even  weaker 
model  of  partial  synchrony:  one  in  which  there  are  bounds  on  message  transmission 
times  and  relative  process  speeds,  but  these  bounds  Me  not  known  and  they  hold  only 
after  some  unknown  time.  Since  OV  is  stronger  than  OW,  by  Corollaries  20  and  32, 
this  implementation  immediately  gives  Consensus  and  Atomic  Broadcast  solutions  for 
this  model  of  partial  synchrony  and,  a  fortiori,  for  the  two  models  of  [DLS88].  The 
implementation  of  OV  is  given  in  Figure  8,  and  proven  below. 

Each  process  p  periodically  sends  a  “p-is-alive”  message  to  all  the  processes.  If  p  does 
not  receive  a  “q-is- alive”  message  from  some  process  q  for  A p(q)  units  of  time,  p  adds 
q  to  its  list  of  suspects.  If  p  receives  “q-is-alive”  from  some  process  q  that  it  currently 
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suspects,  p  knows  that  its  previous  time-out  on  q  was  premature.  In  this  case,  p  removes 
q  from  its  list  of  suspects  and  increases  the  length  of  the  time-out. 

Theorem  35:  Consider  a  system  in  which,  after  some  time  t ,  some  bounds  on  relative 
process  speeds  and  on  message  transmission  times  hold  (we  do  not  assume  that  t  or  the 
value  of  these  bounds  are  known).  The  algorithm  in  Figure  8  implements  an  Eventually 
Perfect  failure  detector  OV  in  this  system. 

Proof:  ( sketch )  We  first  show  that  strong  completeness  holds,  i.e.,  eventually  every 
process  that  crashes  is  permanently  suspected  by  every  correct  process.  Suppose  a  pro¬ 
cess  q  crashes.  Clearly,  q  eventually  stops  sending  “g-is-alive”  messages,  and  there  is  a 
time  after  which  no  correct  process  receives  such  a  message.  Thus,  there  is  a  time  t'  after 
which:  (1)  all  correct  processes  time-out  on  q  (Task  2),  and  (2)  they  do  not  receive  any 
message  from  q  after  this  time-out.  From  the  algorithm,  it  is  clear  that  after  time  t',  all 
correct  processes  will  permanently  suspect  q.  Thus,  strong  completeness  is  satisfied. 

We  now  show  that  eventual  strong  accuracy  is  satisfied.  That  is,  for  any  correct 
processes  p  and  q,  there  is  a  time  after  which  p  will  not  suspect  q.  There  are  two  possible 
cases: 

1.  Process  p  times-out  on  q  finitely  often  (in  Task  2).  Since  q  is  correct  and  keeps 
sending  “g-is-alive”  messages  forever,  eventually  p  receives  one  such  message  after 
its  last  time-out  on  q.  At  this  point,  q  is  permanently  removed  from  p’s  list  of 
suspects  (Task  3). 

2.  Process  p  times-out  on  q  infinitely  often  (in  Task  2).  Note  that  p  times-out  on  q 

(and  so  p  adds  q  to  output p)  only  if  q  is  not  already  in  output p.  Thus,  q  is  added  to 
and  removed  from  output p  infinitely  often.  Process  q  is  only  removed  from  output 
in  Task  3,  and  every  time  this  occurs  the  time-out  period  A p(q)  is  increased.  Since 
this  occurs  infinitely  often,  A p(q)  grows  unbounded.  Thus,  eventually  (1)  the 
bounds  on  relative  process  speeds  and  on  message  transmission  times  hold,  and 
(2)  A p(q)  is  larger  than  the  correct  time-out  based  on  these  bounds.  After  this 
point,  p  cannot  time-out  on  q  any  more — a  contradiction  to  our  assumption  that 
p  times-out  on  q  infinitely  often.  Thus  Case  2  cannot  occur.  □ 

Thus,  failure  detectors  can  be  viewed  as  a  more  abstract  and  modular  way  of  incorpo¬ 
rating  partial  synchrony  assumptions  into  the  model  of  computation.  Instead  of  focusing 
on  the  operational  features  of  partial  synchrony  (such  as  the  five  parameters  considered 
in  [DDS87]),  we  can  consider  the  axiomatic  properties  that  failure  detectors  must  have 
in  order  to  solve  Consensus.  The  problem  of  implementing  a  given  failure  detector  in 
a  specific  model  of  partial  synchrony  becomes  a  separate  issue;  this  separation  affords 
greater  modularity. 

Studying  failure  detectors  rather  than  various  models  of  partial  synchrony  has  other 
advantages  as  well.  By  showing  that  Consensus  is  solvable  using  some  specific  failure 
detector  we  thereby  show  that  Consensus  is  solvable  in  all  systems  in  which  that  failure 
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detector  can  be  implemented.  An  algorithm  that  relies  on  the  axiomatic  properties  of 
a  given  failure  detector  is  more  general,  more  modular,  and  simpler  to  understand  than 
one  that  relies  directly  on  specific  operational  features  of  partial  synchrony  (that  can  be 
used  to  implement  the  given  failure  detector). 

Prom  this  more  abstract  point  of  view,  the  question  “What  is  the  least  amount  of 
synchrony  sufficient  to  solve  Consensus?”  translates  to  “What  is  the  weakest  failure 
detector  sufficient  to  solve  Consensus?”.  In  contrast  to  [DDS87],  which  identified  a 
set  of  minima/  models  of  partial  synchrony  in  which  Consensus  is  solvable,  in  [CHT92] 
we  are  able  to  exhibit  a  single  minimum  failure  detector,  OW,  that  can  be  used  to 
solve  Consensus.  The  technical  device  that  made  this  possible  is  the  notion  of  reduction 
between  failure  detectors. 

8.2  The  application  of  failure  detection  in  shared  memory  sys¬ 
tems 

Loui  and  Abu-Amara  showed  that  in  an  asynchronous  shared  memory  system  with 
atomic  read/ write  registers,  Consensus  cannot  be  solved  even  if  at  most  one  process 
may  crash  [LA87].  This  raises  the  following  natural  question:  can  we  circumvent  this 
impossibility  result  using  unreliable  failure  detectors?  In  a  recent  work,  Lo  shows  that 
this  is  indeed  possible  [Lo93].  In  particular,  he  shows  that  using  a  Strong  failure  detector 
and  atomic  registers,  one  can  solve  Consensus  for  any  number  of  failures.  He  also  shows 
that  for  systems  with  a  majority  of  correct  processes,  it  is  sufficient  to  use  an  Eventually 
Strong  failure  detector  and  atomic  registers. 

8.3  The  Isis  toolkit 

With  our  approach,  even  if  a  correct  process  p  is  repeatedly  suspected  to  have  crashed 
by  the  other  processes,  it  is  still  required  to  behave  like  every  other  correct  process  in 
the  system.  For  example,  with  Atomic  Broadcast,  p  is  still  required  to  A-deliver  the 
same  messages,  in  the  same  order,  as  all  the  other  correct  processes.  Furthermore,  p  is 
not  prevented  from  A-broadcasting  messages,  and  these  messages  must  eventually  be  A- 
delivered  by  all  correct  processes  (including  those  processes  whose  local  failure  detector 
modules  permanently  suspect  p  to  have  crashed).  In  summary,  application  programs  that 
use  unreliable  failure  detection  are  aware  that  the  information  they  get  from  the  failure 
detector  may  be  incorrect:  they  only  take  this  information  as  an  imperfect  “hint”  about 
which  processes  have  really  crashed.  Furthermore,  processes  are  never  “discriminated 
against”  if  they  are  falsely  suspected  to  have  crashed. 

Isis  takes  an  alternative  approach  based  on  the  assumption  that  failure  detectors 
rarely  make  mistakes  [RB91].  In  those  cases  in  which  a  correct  process  p  is  falsely  sus¬ 
pected  by  the  failure  detector,  p  is  effectively  forced  “to  crash”  (via  a  group  membership 
protocol  that  removes  p  from  all  the  groups  that  it  belongs  to).  An  application  using 
such  a  failure  detector  cannot  distinguish  between  a  faulty  process  that  really  crashed, 
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and  a  correct  one  that  was  forced  to  do  so.  Essentially,  the  Isis  failure  detector  forces  the 
system  to  conform  to  its  view.  From  the  application’s  point  of  view,  this  failure  detector 
looks  “perfect”:  it  never  makes  visible  mistakes. 

For  the  Isis  approach  to  work,  the  low-level  time-outs  used  to  detect  crashes  must  be 
set  very  conservatively:  Premature  time-outs  are  costly  (each  results  in  the  removal  of 
a  process),  and  too  many  of  them  can  lead  to  system  shutdown.19  In  contrast,  with  our 
approach,  premature  time-outs  (e.g.,  failure  detector  mistakes)  are  not  so  deleterious: 
they  can  only  delay  an  application.  In  other  words,  premature  time-outs  can  affect  the 
liveness  but  not  the  safety  of  an  application.  For  example,  consider  the  Atomic  Broad¬ 
cast  algorithm  that  uses  O  W.  If  the  failure  detector  “malfunctions” ,  some  messages  may 
be  delayed,  but  no  message  is  ever  delivered  out  of  order,  and  no  correct  process  is  re¬ 
moved.  If  the  failure  detector  stops  malfunctioning,  outstanding  messages  axe  eventually 
delivered.  Thus,  we  can  set  time-out  periods  more  aggressively  than  Isis:  in  practice, 
we  would  set  our  failure  detector  time-out  periods  closer  to  the  average  case,  while  Isis 
must  set  time-outs  to  the  worst-case. 

8.4  Other  work 

Several  works  in  fault-tolerant  computing  used  time-outs  primarily  or  exclusively  for  the 
purpose  of  failure  detection.  An  example  of  this  approach  is  given  by  an  algorithm  in 
[ADLS91],  which,  as  pointed  out  by  the  authors,  “can  be  viewed  as  an  asynchronous 
algorithm  that  uses  a  fault  detection  (e.g.,  timeout)  mechanism.” 
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Appendix:  A  hierarchy  of  failure  detectors  and  bounds 
on  fault-tolerance 

In  the  preceding  sections,  we  introduced  the  concept  of  unreliable  failure  detectors  that 
could  make  mistakes,  and  showed  how  to  use  them  to  solve  Consensus  despite  such 
mistakes.  Informally,  a  mistake  occurs  when  a  correct  process  is  erroneously  added  to 
the  list  of  processes  that  are  suspected  to  have  crashed.  In  this  Appendix,  we  formalise 
this  concept  and  study  a  related  property  that  we  call  repentance.  Informally,  if  a  process 
p  learns  that  its  failure  detector  module  Vp  made  a  mistake,  repentance  requires  T>p  to 
take  corrective  action.  Based  on  mistakes  and  repentance,  we  define  a  hierarchy  of  failure 
detector  specifications  that  will  be  used  to  unify  some  of  our  results,  and  to  refine  the 
lower  bound  on  fault-tolerance  given  in  Section  6.3.  This  infinite  hierarchy  consists  of  a 
continuum  of  repentant  failure  detectors  ordered  by  the  maximum  number  of  mistakes 
that  each  one  can  make. 

Mistakes  and  Repentance 

We  now  define  a  mistake.  Let  R  =  (F,  H,  I,  S,T)  be  any  run  using  a  failure  detector  V. 

V  makes  a  mistake  in  R  at  time  t  on  process  p  about  process  q  if  at  time  t,  p  begins  to 
suspect  that  q  has  crashed  even  though  q  F(t).  Formally: 

\q  F(t),q  €  H(p,t)]  and  [3 1'  <  t,V<",<'  <t"  <t:q$  H(p,t")\ 

Such  a  mistake  is  denoted  by  the  tuple  ( R,pfq,t ).  The  set  of  mistakes  made  by  V  in  R 
is  denoted  by  M(R). 

Note  that  only  the  erroneous  addition  of  q  into  Vp  is  counted  as  a  mistake  on  p.  The 
continuous  retention  of  q  into  Vp  does  not  count  as  additional  mistakes.  Thus,  a  failure 
detector  can  make  multiple  mistakes  on  a  process  p  about  another  process  q  only  by 
repeatedly  adding  and  then  removing  q  from  the  set  Vp.  In  practice,  mistakes  are  caused 
by  premature  time-outs. 

We  define  the  following  four  types  of  accuracy  properties  for  a  failure  detector  V 
based  on  the  mistakes  made  by  V: 

•  Strongly  k— mistaken:  V  makes  at  most  k  mistakes.  Formally,  V  is  strongly 
k— mistaken  if: 

Vi?  using  V  :  |M(i?)|  <  k 

•  Weakly  k—  mistaken:  There  is  a  correct  process  p  such  that  V  makes  at  most  k 
mistakes  about  p.  Formally,  V  is  weakly  k-  mistaken  if: 

Vi?  =  (F,H,I,S,T)  using  Z>,  3 p  €  correct(F) : 

|{<i?,9,p,t>  :  (R,q,p,t)  €  M(i?)}|  <  k 
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•  Strongly  finitely  mistaken:  V  makes  a  finite  number  of  mistakes.  Formally,  V  is 
strongly  finitely  mistaken  if: 

V72  using  V  :  M(R)  is  finite. 

In  this  case,  it  is  clear  that  there  is  a  time  t  after  which  T>  stops  making  mistakes. 

•  Weakly  finitely  mistaken:  There  is  a  correct  process  p  such  that  V  makes  a  finite 
number  of  mistakes  about  p.  Formally,  V  is  weakly  finitely  mistaken  if: 

V72  =  (F,  H ,  7, S,  T)  using  V,  3 p  G  correct(F) : 

{(R,q,p,t)  :  (R,q,p,t)  €  Af(72)}  is  finite. 

In  this  case,  there  is  a  time  t  after  which  V  stops  making  mistakes  about  p. 

For  most  values  of  k,  the  properties  mentioned  above  are  not  powerful  enough  to  be 
useful.  For  example,  suppose  every  process  permanently  suspects  every  other  process. 
In  this  case,  the  failure  detector  makes  at  most  (n  -  l)2  mistakes,  but  it  is  clearly  useless 
since  it  does  not  provide  any  information. 

The  core  of  this  problem  is  that  such  failure  detectors  are  not  forced  to  reverse  a 
mistake,  even  when  a  mistake  becomes  “obvious”  (say,  after  a  process  q  replies  to  an 
inquiry  that  was  sent  to  q  after  q  was  suspected  to  have  crashed).  However,  we  can  impose 
a  natural  requirement  to  circumvent  this  problem.  Consider  the  following  scenario.  The 
failure  detector  module  at  process  p  erroneously  adds  q  to  Vp  at  time  t.  Subsequently,  p 
sends  a  message  to  q  and  receives  a  reply.  This  reply  is  a  proof  that  q  had  not  crashed 
at  time  t.  Thus,  p  knows  that  its  failure  detector  module  made  a  mistake  about  q.  It 
is  reasonable  to  require  that,  given  such  irrefutable  evidence  of  a  mistake,  the  failure 
detector  module  at  p  takes  the  corrective  action  of  removing  q  from  Vp.  In  general,  we 
can  require  the  following  property: 

•  Repentance:  If  a  correct  process  p  eventually  knows  that  q  F(t),  then  at  some 
time  after  t,  q  Vp.  Formally,  V  is  repentant  if: 

Vi?  =  (F,  H,  7,  S,  T)  using  V ,  Vt,  Vp,  q  €  II : 

|3t- :  (R,f)  (=  Kf{q  i  F(t))]  =>  [3t"  >  t :  ,  *  H(p,  t")] 

The  knowledge  theoretic  operator  Kp  can  be  defined  formally  [HM90].  Informally  (R,  t)  (= 
0  iff  in  run  R  at  time  t,  predicate  <f>  holds.  We  say  (72,  t)  ~p  (72 \f)  iff  the  run  72  at 
time  t  and  the  run  72'  at  time  t'  are  indistinguishable  to  p.  Finally,  (72,  t)  (=  Kp(<j>)  <=> 
^(jR'jt')  ~p  (72,  t)  :  (R',f)  |=  <f>].  For  a  detailed  treatment  of  Knowledge  Theory  as 
applied  to  distributed  systems,  the  reader  should  refer  to  the  seminal  work  done  in 
[MDH86,  HM90]. 
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Recall  that  in  Section  2.2  we  defined  a  failure  detector  to  be  a  function  that  maps  each 
failure  pattern  to  a  set  of  failure  detector  histories.  Thus,  the  specification  of  a  failure 
detector  depends  solely  on  the  failure  pattern  actually  encountered.  In  contrast,  the  defi¬ 
nition  of  repentance  depends  on  the  knowledge  (about  mistakes)  at  each  process.  This  in 
turn  depends  on  the  algorithm  being  executed,  and  the  communication  pattern  actually 
encountered.  Thus,  repentant  failure  detectors  cannot  be  specified  solely  in  terms  of  the 
,  failure  pattern  actually  encountered.  Nevertheless,  repentance  is  an  important  property 

that  we  would  like  many  failure  detectors  to  satisfy. 

In  the  rest  of  this  Appendix,  we  informally  define  a  hierarchy  of  repentant  failure 
detectors  that  differ  by  their  accuracy  (i.e.,  the  maximum  number  of  mistakes  they  can 
make).  As  we  just  noted,  such  failure  detectors  cannot  be  specified  solely  in  terms  of 
the  failure  pattern  actually  encountered,  and  thus  they  do  not  fit  the  formal  definition 
of  failure  detectors  given  in  Section  2.2. 

A  hierarchy  of  repetant  failure  detectors 

We  now  define  an  infinite  hierarchy  of  repentant  failure  detectors.  Every  failure 
detector  in  this  hierarchy  satisfies  weak  completeness,  repentance,  and  one  of  the  four 
types  of  accuracy  that  we  defined  in  the  previous  section.  We  name  these  failure  detectors 
after  the  accuracy  property  that  they  satisfy: 

•  ST(k)  denotes  a  Strongly  k-Mistaken  failure  detector, 

•  ST  denotes  a  Strongly  Finitely  Mistaken  failure  detector , 

•  WT(k)  denotes  a  Weakly  k-Mistaken  failure  detector,  and 

•  WT  denotes  a  Weakly  Finitely  Mistaken  failure  detector. 

Clearly,  «S,F(0)  y  ST(1)  y  . . .  ST(k)  y  ST(k  +  1)  y  . . .  y  ST.  A  similar  order 
holds  for  the  WTs.  Consider  a  system  of  n  processes  of  which  at  most  /  may  crash.  In 
*  this  system,  there  are  at  least  n— /  correct  processes.  Since  <SJF((n  -  /)  -  1)  makes  fewer 

mistakes  than  the  number  of  correct  processes,  there  is  at  least  one  correct  process  that 
m  it  never  suspects.  Thus,  ST((n  -  f)  -  1)  is  weakly  O-mistaken,  and  ST((n  -  f)  -  1)  y 

WT(0).  Furthermore,  it  is  clear  that  ST  y  WT.  This  infinite  hierarchy  of  failure 
detectors,  ordered  by  reducibility,  is  illustrated  in  Figure  9  (where  an  edge  — ►  denotes 
the  y  relation). 

Each  of  the  eight  failure  detectors  that  we  considered  in  Section  2.4  is  equivalent  to 
some  failure  detector  in  this  hierarchy.  In  particular,  it  is  easy  to  show  that: 

Observation  36: 


•  V  a  Q  a  ST{0), 

•  S  a  W  ss  WT( 0), 
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Q  S!F( 0)  ^  V  =  Q  (strongest) . Consensus  solvable  for  all  /  <  n 

<S^(1) . Consensus  solvable  iff  /  <  n 

CD.  Sf(2) . Consensus  solvable  iff  /  <  n  -  1 


WF(  0) 

Consensus  solvable 
for  all  /  <  n 


Consensus  solvable  iff 

/<rsi 


ST{n  -  /  -  1) 

5^(LfJ  —  1) . Consensus  solvable  iff  /  <  +  2 

ST(l*\) . Consensus  solvable  iff  /  <  f|"|  +  1 

!  S-F(L!J  +  2) 


oq 

WF  —  OS  =  OW  (weakest) 


Figure  9:  The  hierarchy  of  repentant  failure  detectors  ordered  by  reducibility.  This  figure 
also  shows  the  maximum  number  of  faulty  processes  for  which  Consensus  can  be  solved 
ntring  each  failure  detector  in  this  hierarchy. 
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•  OV^OQ  =  ST,  and 

•  05  =  OW  =  WT. 

For  example,  it  is  easy  to  see  that  the  reduction  algorithm  in  Figure  3  transforms  WT 
into  OW.  Other  conversions  are  similar  or  straightforward  and  axe  therefore  omitted. 
Note  that  V  and  OW  are  the  strongest  and  weakest  failure  detectors  in  this  hierarchy, 

*  respectively.  From  Corollaries  15  and  31,  and  Observation  36  we  have: 

Corollary  37:  Given  WJ^(O),  Consensus  and  Atomic  Broadcast  are  solvable  in  asyn¬ 
chronous  systems  with  f  <n. 

Similarly,  from  Corollaries  20  and  32,  and  Observation  36  we  have: 

Corollary  38:  Given  WT,  Consensus  and  Atomic  Broadcast  are  solvable  in  asyn¬ 
chronous  systems  with  /  <  f  • 

Tight  bounds  on  fault-tolerance 

Since  Consensus  and  Atomic  Broadcast  are  equivalent  in  asynchronous  systems  with  any 
number  of  faulty  processes  (Theorem  30),  we  can  focus  on  establishing  fault-tolerance 
bounds  for  Consensus.  In  Section  6,  we  showed  that  failure  detectors  with  perpetual 
accuracy  (i.e.,  V,  Q,  5,  or  W)  can  be  used  to  solve  Consensus  in  asynchronous  systems 
with  any  number  of  failures.  In  contrast,  with  failure  detectors  with  eventual  accuracy 
(i.e.,  OV,  OQ,  05,  or  OW),  Consensus  can  be  solved  if  and  only  if  a  majority  of  the 
processes  are  correct.  We  now  refine  this  result  by  considering  each  failure  detector  V  in 
our  infinite  hierarchy  of  failure  detectors,  and  determining  how  many  correct  processes 
are  necessary  to  solve  Consensus  using  V.  The  results  are  illustrated  in  Figure  9. 

There  are  two  cases  depending  on  whether  we  assume  that  the  system  has  a  majority 
of  correct  processes  or  not.  Since  OW,  the  weakest  failure  detector  in  the  hierarchy,  can 
be  used  to  solve  Consensus  when  a  majority  of  the  processes  are  correct,  we  have: 

e  Observation  39:  If  /  <  |  then  Consensus  can  be  solved  using  any  failure  detector  in 

the  hierarchy  of  Figure  9. 

• 

We  now  consider  the  solvability  of  Consensus  in  systems  that  do  not  have  a  majority  of 
correct  processes.  For  these  systems,  we  determine  the  maximum  m  for  which  Consensus 
is  solvable  using  ST(m)  or  WT{m).  We  first  show  that  Consensus  is  solvable  using 
ST{m)  if  and  only  if  m,  the  number  of  mistakes,  is  less  than  or  equal  to  n  —  /,  the 
number  of  correct  processes.  We  then  show  that  Consensus  is  solvable  using  WT(m)  if 
and  only  if  m  =  0. 

Theorem  40:  Suppose  /  >  If  m  >  n  -  /  then  there  is  a  Strongly  m- Mistaken  failure 
detector  ST(m)  such  that  there  is  no  algorithm  A  which  solves  Consensus  using  ST(m) 
in  asynchronous  systems. 
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PROOF:  [sketch]  We  describe  the  behaviour  of  a  Strongly  m- Mistaken  failure  detector 
SF(m)  such  that  with  every  algorithm  A ,  there  is  a  run  Ra  of  A  using  SF{m)  that  does 
not  satisfy  the  specification  of  Consensus.  Since  1  <  n  —  /  <§,  we  can  partition  the 
processes  into  three  sets  IIo,ni  and  II crtuhed,  such  that  IIo  and  III  are  non-empty  sets 
containing  n  -  /  processes  each,  and  Ucrtuhed  is  a  (possibly  empty)  set  containing  the 
remaining  n  —  2 (n  —  /)  processes.  For  the  rest  of  this  proof,  we  will  only  consider  runs 
in  which  all  processes  in  Ucrtuhed  crash  in  the  beginning  of  the  run.  Let  q0  €  n0  and 
qi  €  III.  Consider  any  Consensus  algorithm  A,  and  the  following  two  runs  of  A  using 
Sf(m): 

•  Run  Ro  =  (F0,  H0yI,  So,T0):  All  processes  in  IIo  propose  0,  and  all  processes 
in  III  U  ^crashed  propose  1.  All  processes  in  IIo  are  correct  in  F0,  while  all  the 
/  processes  in  IIj  U  Ucrtuhed  crash  in  F0  at  the  beginning  of  the  run,  i.e.,  Vt  € 
T  :  Fo(t)  =  Hi  U  Ucrtuhed-  Process  q0  permanently  suspects  every  process  in 
III  U  United,  i-e>,  Vt  G  T  :  Ho(qo,t)  =  III  U  II crtuhed  =  F0(t).  No  other  process 
suspects  any  process,  i.e.,  Vt  €  T,Vg  qo  :  H0{q,t)  =  0.  In  this  run,  it  is  clear 
that  SF(m)  satisfies  the  specification  of  a  Strongly  m-Mistaken  failure  detector. 

•  Run  Ri  =  (Fi,Hi,I,Si,Ti):  As  in  Ro,  all  processes  in  n0  propose  0,  and  all 
processes  in  III  U  Ucrtuhed  propose  1.  All  processes  in  III  are  correct  in  Fi,  while 
all  the  /  processes  in  n0  U  II crtuhed  crash  in  Fx  at  the  beginning  of  the  run,  i.e., 
Vt  €  T  :  Fi(t)  =  II0  U  IIcto,^.  Process  qi  permanently  suspects  every  process  in 
n0  U  Unshed,  and  no  other  process  suspects  any  process.  Clearly,  SF(m)  satisfies 
the  specification  of  a  Strongly  m-Mistaken  failure  detector  in  this  run. 

Assume,  without  loss  of  generality,  that  both  Ro  and  R\  satisfy  the  specification  of 
Consensus.  Let  to  be  the  time  at  which  qo  decides  in  Ro,  and  let  tx  be  the  time  at 
which  qi  decides  in  There  are  three  possible  cases — in  each  case  we  construct  a  run 
Ra  =  {Fa,Ha,Ia}Sa,Ta)  of  algorithm  A  using  SF(m)  such  that  SF(m)  satisfies  the 
specification  of  a  Strongly  m-Mistaken  failure  detector,  but  RA  violates  the  specification 
of  Consensus. 

1.  In  Ro,  qo  decides  1.  Let  Ra  =  (F0,  Ho,  I  a,  So,  To)  be  a  run  identical  to  Ro  except 
that  all  processes  in  niUlIcr*./^  propose  0.  Since  in  F0  the  processes  in  Ifyuncra,^ 
crash  right  from  the  beginning  of  the  run,  Ro  and  RA  are  indistinguishable  to  q0. 
Thus,  qo  decides  1  in  Ra  (as  it  did  in  Ro),  thereby  violating  the  uniform  validity 
condition  of  Consensus. 

2.  In  Rx,  qi  decides  0.  This  case  is  symmetric  to  Case  1. 

3.  In  Ro,  go  decides  0,  and  in  Rx,  qi  decides  1.  Construct  Ra  =  (Fa,Ha,I,Sa,Ta) 
as  follows.  As  before,  all  processes  in  IIo  propose  0,  all  processes  in  Ify  U 
propose  1,  and  all  processes  in  11^^  crash  in  Fa  at  the  beginning  of  the  run.  All 
messages  from  processes  in  IIo  to  those  in  IIx  and  vice-versa,  are  delayed  until  time 
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<o  -Mi-  Until  time  to,  Ra  is  identical  to  R0,  except  that  the  processes  in  III  do  not 
crash,  they  are  only  “very  slow”  and  do  not  take  any  steps  before  time  t0.  Thus, 
until  time  t0,  qo  cannot  distinguish  between  R0  and  Ra,  and  it  decides  0  at  time 
to  in  Ra  (as  it  did  in  ilo)-  Note  that  by  time  f0,  the  failure  detector  ST(m)  made 
n  —  /  mistakes  in  Ra :  go  erroneously  suspected  that  all  processes  in  III  crashed 
(while  they  were  only  slow). 

From  time  to,  the  construction  of  Ra  continues  as  follows. 

(a)  At  time  to,  all  processes  in  Ilo,  except  g0,  crash  in  FA. 

(b)  From  time  t0  to  time  to  4-  ti,  ?i  suspects  all  processes  in  Ilo  U  ^crashed,  i.e., 
Vf,t0  <  t  <  to  +  <i  :  ifx(gi,t)  =  Ilo  U  II crashed,  and  no  other  process  suspects 
any  process.  By  suspecting  all  the  processes  in  Ilo,  including  q0,  the  failure 
detector  makes  one  mistake  on  process  qi  (about  q0).  Thus,  by  time  t0  +  tj, 
SP(m)  has  made  a  total  of  (n  —  /)  +  1  mistakes  in  Ra-  Since  m  >  n  —  f, 
<S.F(m)  has  made  at  most  m  mistakes  in  Ra  until  time  t0  + t\. 

(c)  At  time  t0,  processes  in  III  “wake  up.”  From  time  t0  to  time  t0  +  ti  they 
execute  exactly  as  they  did  in  R\  from  time  0  to  time  t\  (they  cannot  perceive 
this  real-time  shift  of  t0).  Thus,  at  time  t0  +  <i  in  run  Ra,  qi  decides  1  (as  it 
did  at  time  ti  in  Ri).  So  qo  and  qi  decide  differently  in  Ra,  and  this  violates 
the  agreement  condition  of  Consensus. 

(d)  From  time  t0 +*i  onwards  the  run  Ra  continues  as  follows.  No  more  processes 
crash  and  every  correct  process  suspects  exactly  all  the  processes  that  have 
crashed.  Thus,  SP(m)  satisfies  weak  completeness,  repentance,  and  makes 
no  further  mistakes. 

By  (b)  and  (d),  SJ-(m)  satisfies  the  specification  of  a  Strongly  m-Mistaken  failure 
detector  in  run  Ra-  From  (c),  Ra,  a  run  of  A  that  uses  S.F(m),  violates  the 
specification  of  Consensus.  □ 

We  now  show  that  the  above  lower  bound  is  tight:  Given  ST(m),  Consensus  can  be 
solved  in  asynchronous  systems  with  m  <  n  —  /. 

Theorem  41:  If  m  <  n  —  /  then  Consensus  can  be  solved  in  asynchronous  systems 
using  any  Strongly  m-Mistaken  failure  detector  SF(m). 

Proof:  Suppose  m  <  n  —  f .  Since  m,  the  number  of  mistakes  made  by  «S.F(m), 
is  less  than  the  number  of  correct  processes,  there  is  at  least  one  correct  process  that 
SF(m)  never  suspects.  Thus,  SF{m)  satisfies  weak  accuracy.  By  definition,  ST(m) 
also  satisfies  weak  completeness.  So  SF(m)  is  a  Weak  failure  detector  and  can  be  used 
to  solve  Consensus  (Corollary  15). 

Suppose  m  =  n  —  f.  Even  though  SF(m)  can  now  make  a  mistake  on  every  correct 
process,  it  can  still  be  used  to  solve  Consensus  (even  if  a  majority  of  the  processes  are 
faulty).  The  algorithm  uses  rotating  coordinators,  and  is  similar  to  the  one  for  OW  in 
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Figure  6.  Because  of  this  similarity,  we  omit  the  details  from  this  Appendix.  □ 

From  the  above  two  theorems: 

Corollary  42:  Suppose  /  >  Consensus  can  be  solved  in  asynchronous  systems  using 
any  ST{m)  if  and  only  if  m  <  n  -  /. 

We  now  turn  our  attention  to  Weakly  /c- Mistaken  failure  detectors. 

Theorem  43:  Suppose  /  >  |.  If  m  >  0  then  there  is  a  Weakly  m- Mistaken  failure  de¬ 
tector  WT(rn)  such  that  there  is  no  algorithm  A  which  solves  Consensus  using  WT(m) 
in  asynchronous  systems. 

Proof:  In  Theorem  40,  we  described  a  failure  detector  that  cannot  be  used  to  solve 
Consensus  in  asynchronous  systems  with  /  >  |.  It  is  easy  to  verify  that  this  failure 
detector  makes  at  most  one  mistake  about  each  correct  process,  and  thus  it  is  a  Weakly 
m-Mistaken  failure  detector.  □ 

From  Corollary  37  and  the  above  theorem,  we  have: 

Corollary  44:  Suppose  /  >  Consensus  can  be  solved  in  asynchronous  systems  using 
any  WT(m)  if  and  only  if  m  =  0. 


